Compare commits

..

327 commits

Author SHA1 Message Date
d1c3eb681a docs(status): coordinator-FQDN pin applied + live on ubongo (2026-06-20)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 12:01:29 +02:00
1299eef6ea Merge feat/mesh-spof-resilience: accept mesh SPOF (R8) + coordinator DNS-resilience pin
Sub-project 3 of mesh-hardening. Accepts the single off-site coordinator as a documented availability SPOF (R8 + ADR-016 availability amendment) given the narrow blast radius: LAN, intra-cluster, and local-service traffic never traverse the mesh; only remote relayed mesh access breaks. Hardens the one real gap — a base mesh coordinator-FQDN /etc/hosts pin (base__mesh_coordinator_pin, wired for ubongo) so a local-DNS hiccup can't strand the mesh. No new infra; coordinator off-site backup deferred to ADR-022.

Whole-branch review: ready to merge with fixes (applied: anchored pin regexp, ADR-016 backup notes, verify comment).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:42:49 +02:00
0030b45bbd docs(adr-016): soften the second stale off-site-backup claim (R8 consistency)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:42:49 +02:00
a483f4e55c fix: address whole-branch review (anchor pin regexp, ADR-016 backup note, verify comment)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:41:19 +02:00
c09b7fe6a5 docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:34:21 +02:00
74e54b359b fix(base): confine /etc/hosts unsafe-write fallback to the Docker Molecule env
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 11:31:15 +02:00
f83d68d7a0 feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)
Adds base__mesh_coordinator_pin (default empty = no-op). When set + base__mesh_enabled,
a lineinfile task writes "<ip> <fqdn>" to /etc/hosts so a managed mesh host survives a
local-DNS hiccup (the 2026-06-18 incident class). FQDN derived from base__mesh_management_url
via regex_replace (no community.general). Gated on base__mesh_enabled | bool and pin length;
the coordinator host (askari/offsite_hosts) stays exempt. Production pin wired for ubongo
(77.42.120.136). Molecule dns_servers fix included (Docker/NetBird DNS incompatibility).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-20 11:22:40 +02:00
0286c78f36 docs(plan): mesh-hardening SPOF — accept + DNS-resilience implementation plan
Two tasks: a base mesh coordinator-FQDN /etc/hosts pin (Molecule TDD) + the accept-and-document docs (R8, ADR-016 availability amendment, STATUS/ROADMAP). Coordinator backup deferred to ADR-022.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:49:26 +02:00
3ba22d199a docs(spec): mesh-hardening SPOF — accept single-coordinator SPOF + DNS-resilience pin
Sub-project 3 of the mesh-hardening follow-on. Accepts the single off-site coordinator as a documented availability SPOF (R8 + ADR-016 amendment) given the narrow blast radius (LAN/intra-cluster/local traffic unaffected; only remote relayed mesh access breaks). Hardens the one real gap: a base mesh coordinator-FQDN /etc/hosts pin so managed hosts survive a local-DNS hiccup. Coordinator off-site backup explicitly deferred to an ADR-022 kickoff (no throwaway infra).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 10:42:19 +02:00
f10fe8bb60 docs(status): mesh-hardening askari redesign applied + live reboot-validated (2026-06-20)
Live cutover complete: base INPUT-only default-deny + wt0-primary SSH + permanent WAN break-glass on askari, netbird_coordinator geo-disabled. A real reboot recovered unattended — firewall persisted, Docker forwarding + public services up, coordinator geo-disabled (no FATAL), mesh + both SSH paths back. ROADMAP sub-project 3 (askari redesign) marked DONE; next = relay-SPOF reduction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 09:22:20 +02:00
dfc64da2eb feat(makefile): add EXTRA passthrough to check/deploy for ad-hoc ansible args
Lets an operator pass extra ansible-playbook args through make without bypassing it — e.g. -e ansible_host=<WAN> to manage a host over a relay-independent path during a cutover that restarts its own mesh relay.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 09:22:20 +02:00
0194865437 Merge feat/mesh-hardening-askari-redesign: askari INPUT-only redesign + reboot gate
Mesh-hardening redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Three tasks:

1. netbird_coordinator: disable geolocation (NB_DISABLE_GEOLOCATION) so a no-egress startup can't FATAL the control plane.

2. inventory: askari INPUT-only nftables default-deny (forward stays accept, Docker-safe) + ubongo's static WAN IP as a permanent SSH break-glass + manage over wt0; no sshd ListenAddress change (no boot-race); WAN :22 deliberately left open.

3. ADR-025 harness: askari_inputonly profile proves reboot-safety on a KVM VM (GREEN). Includes leaseshelper-independent VM-IP discovery (arp fallback) and an Ansible-managed virbr-boma nftables drop-in. A suid-root workaround the first implementer installed was backed out; nothing privileged reintroduced.

Whole-branch review (opus): ready to merge. Task 4 (live cutover) is operator-gated, not in this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:47:03 +02:00
d6e80990b2 fix(integration): real wait_for_ip arp-fallback test + document substrate coverage gap
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:41:11 +02:00
d1941c987e feat(integration_test): Ansible-manage virbr-boma nftables input allow
Adds a nftables drop-in (10-libvirt-boma.nft) to base's drop-in dir that
allows traffic on iifname "virbr-boma" in the inet filter input chain.
Fixes DHCP/DNS being dropped by base's default-deny INPUT policy for VMs
on the libvirt integration bridge. Mirrors docker_host's drop-in pattern.

Molecule scenario updated to exercise only the firewall tasks (package
install unavailable in the no-internet Docker container) via include_role
tasks_from; verify asserts the drop-in renders the virbr-boma accept rule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:29:45 +02:00
dc5cc8933f fix(harness): fall back to --source arp for VM IP discovery (no leaseshelper)
wait_for_ip now tries --source lease first then --source arp; both produce
identical output handled by parse_lease_ip. Removes the suid leaseshelper
dependency introduced and backed out in Task 3. New unit test confirms
parse_lease_ip works on --source arp output format.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 22:29:35 +02:00
4933186d31 docs(friction): task-3 integration-gate findings (dnsmasq, nftables, hostname)
Documents three blockers found while developing the askari_inputonly
integration-test profile:

1. inet filter default-deny silently blocks libvirt dnsmasq DHCP: nftables
   multi-table independence means ip filter LIBVIRT_INP accept does NOT
   prevent inet filter drop. Diagnosed via strace; fixed with a drop-in.

2. libvirt leaseshelper PID-file: virPidFileReleasePath unlinks the file after
   every call; nobody cannot recreate in /run/. Fix: suid root C wrapper.

3. cloud-init rejects underscores in local-hostname → skips network-config
   → no DHCP. Fix: sanitize with replace("_", "-") in meta-data hostname.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:16:45 +02:00
9f0626040b docs(todo): add note on ubongo↔cluster network topology question
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:15:18 +02:00
8ca42c389c fix(integration): fix VM boot: hostname, netplan, known_hosts handling
Three fixes found during askari_inputonly integration-test development:

1. Hostname sanitization: cloud-init rejects underscores in local-hostname
   (silently skips network-config → VM never gets DHCP). Sanitize with
   name.replace("_", "-") for the meta-data hostname; paths/domain names
   keep the original (underscore is valid there).

2. Netplan explicit interface: match.name: en* with a named key produces a
   .network file that networkd never DHCPs. Use explicit enp1s0 (all virtio
   NICs in these KVM VMs) + renderer: networkd to bypass the bug.

3. ansible_ssh_common_args in the generated hosts.yml: integration VMs
   reuse IPs (different VMs at same 192.168.150.x lease). StrictHostKey
   accept-new from ansible.cfg blocks changed keys. Add StrictHostKeyChecking=no
   + UserKnownHostsFile=/dev/null per-host to the generated inventory so
   stale known_hosts entries never block the apply step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:15:07 +02:00
1042f161b6 test(integration): askari_inputonly — INPUT-only default-deny reboot gate
Adds the ADR-025 integration-test profile that proves the askari
mesh-hardening REDESIGN (INPUT-only default-deny, forward ACCEPT for Docker)
is reboot-safe on a throwaway KVM VM before the live cut-over.

Profile applies base (firewall + sshd) and offsite (docker_host +
reverse_proxy). Post-reboot verify checks: input policy drop, forward
policy accept, admin-addr break-glass SSH (192.168.150.1), Docker up,
and a published port answered from the controller. GREEN on 2026-06-19.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-19 19:14:55 +02:00
d9b8676fce feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:18:58 +02:00
ab328a2f79 feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:15:33 +02:00
61cbcc6c18 docs(friction): re-asked settled defaults (push + subagent-driven) at plan->execute handoff
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 17:11:01 +02:00
6be758bece docs(plan): mesh-hardening redesign — askari implementation plan
Four tasks: netbird_coordinator geolocation disable (TDD via Molecule) -> inventory enablement (INPUT-only firewall + WAN break-glass + manage over wt0) -> an askari_inputonly integration profile (the reboot-safety GREEN gate) -> the operator-gated supervised live cutover + STATUS/ROADMAP update. Tasks 1-3 are autonomously implementable; Task 4 is operator-gated (live off-site host, lockout risk).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:32:27 +02:00
a178729587 docs(spec): mesh-hardening redesign — askari wt0-primary + WAN break-glass
Redesign of the backed-out 2026-06-17 askari SSH->wt0 attempt. Mirrors the proven ubongo 2/3 pattern (INPUT-only default-deny, SSH scoped by iifname wt0, no sshd ListenAddress change -> no boot-race) and adds the coordinator-host exception the incident demanded: a permanent non-mesh break-glass (WAN :22 from ubongo's static WAN IP + the Hetzner console), WAN :22 deliberately left open. Folds in the netbird_coordinator geo-DB robustness fix (FRICTION #4) so a transient egress blip can't FATAL the control plane. Harness-GREEN gate before a supervised live cutover.

Operator decision (2026-06-19): do this redesign first, then a separate sub-project to reduce askari's SPOF role.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:25:26 +02:00
ef5e049e9b docs(status): mesh-hardening 2/3 — ubongo reboot-validated
After an operator reboot of ubongo, verified live that the INPUT-only default-deny ruleset re-applied on boot (input chain policy drop + the full wt0/ssh-from-control/admin-addr allow-list), the wt0 mesh came back (Management+Signal Connected), and both SSH paths recovered clean. Closes the 'real-host reboot validation pending' item for mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 16:25:19 +02:00
215060bac1 Merge feat/mesh-hardening-ubongo: ubongo INPUT-only default-deny (mesh-hardening 2/3)
Sub-project 2 of the mesh-hardening follow-on. base gains base__firewall_input_only
(forward-policy knob) + base__firewall_admin_addrs; enabled on ubongo (INPUT-only
default-deny). 'be ubongo' integration profile + profile-aware verify, plus two
harness fixes found by running it (virt-install venv-PATH hijack; nft priority
format). Applied + live-verified on the real ubongo; real-host reboot validation
pending (low-risk). FRICTION: VM-testing standard, libvirt stale-session,
Docker-nat-flush, Molecule coverage gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 15:34:31 +02:00
fa2c4c6368 docs(status): mesh-hardening 2/3 — ubongo INPUT-only default-deny applied
base firewall applied + live-verified on ubongo (INPUT-only default-deny;
base__firewall_input_only). Records the Docker-nat-flush caveat (needs a restart
docker on a Docker host), the claude self-SSH grant, and reboot-validation-pending.
ROADMAP: sub-project 2 done; remaining = NetBird ACL + askari redesign.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 15:34:20 +02:00
a881185c73 docs(friction): base firewall flush wipes Docker nat (cutover finding)
Applying base's nftables (even INPUT-only/forward-accept) to a Docker host
flushes Docker's ip nat -> container egress breaks until 'systemctl restart
docker'. Found on the ubongo mesh-hardening 2/3 live cutover; the Docker-less
test VM couldn't surface it. Self-heals on reboot (dockerd re-adds nat;
forward=accept doesn't block). Runbook/docker_host follow-ups noted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 15:16:21 +02:00
180af46879 docs(friction): log the Molecule input_only-accept coverage gap
Final-review finding: the default Molecule scenario only renders the forward
drop (input_only off) branch; the accept branch is covered by the integration
harness only. Tracked for a kaizen decision (2nd scenario vs accept the split).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:40:29 +02:00
8d8c86fa39 docs(friction): VM-testing standard + libvirt stale-session gotcha
Two signals from running the ubongo harness gate: (1) the operator wants a
standard pre-authorising isolated VM integration tests on ubongo so the agent
doesn't ask each time; (2) a stale agent session (shell predating the
integration_test libvirt-group grant) carries stale process groups, so the
harness's qemu-img/file writes are denied -> run via 'sg libvirt -c ...';
self-heal idea noted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
468f8c3a92 fix(integration): match live nft priority filter in the ubongo verify
`nft list ruleset` prints the symbolic chain priority (`filter` = 0); the ubongo
profile asserted `priority 0` (the rendered-file format the Molecule scenario
checks), so the live-ruleset assertion failed even though the firewall was
correct. Assert `priority filter` for the input/forward policy lines. Caught by
the harness GREEN gate (`make test-integration HOST=ubongo`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
26bb7e442d fix(integration): pin system python for virt-install (venv PATH hijack)
The Makefile prepends .venv/bin to PATH (so the venv's ansible tools resolve),
but virt-install's `#!/usr/bin/env python3` shebang then resolved to the
isolated venv, which lacks system PyGObject (gi) -> ModuleNotFoundError. Strip
.venv/bin from PATH for the virt-install call so its shebang finds
/usr/bin/python3 (which has gi); ansible runs via its absolute .venv path and is
unaffected. Surfaced running `make test-integration HOST=ubongo`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:32:09 +02:00
6ac5afaf67 test(integration): add the 'be ubongo' profile (input-only default-deny)
A control-group VM that applies base with INPUT-only default-deny (forward
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
block asserts input drop + forward accept + the admin-addr rule. Enables
`make test-integration HOST=ubongo`. Mesh-hardening 2/3 (ADR-025).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:52:17 +02:00
b3e14decb4 feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:42:49 +02:00
b10a33f439 feat(base): input-only forward policy + admin-addr SSH allow
base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:37:06 +02:00
66a9a0af08 docs: ubongo admin-addrs add 10.20.10.17 + flag raw-lease follow-up
Allow a second operator workstation (10.20.10.17) onto ubongo's LAN SSH
alongside mamba (10.20.10.50). Both are raw DHCP leases; recorded a FRICTION
open signal to replace them with MAC-pinned OPNsense reservations when
OPNsense-as-code lands (ADR-020 / TODO 3.5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:26:04 +02:00
e14e347047 docs(plan): mesh-hardening 2/3 — ubongo implementation plan
Five tasks: base knobs (input-only forward policy + admin-addr SSH allow,
TDD via Molecule) → enable on the control group → a 'be ubongo' integration
profile (profile-aware verify) → the real-VM harness GREEN gate → the
operator-supervised live cutover (signal-6 order, physical-console break-glass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:26:04 +02:00
24a1d909c9 docs(spec): mesh-hardening 2/3 — ubongo INPUT-only default-deny
Sub-project 2 of the mesh-hardening follow-on (the post-incident roadmap
ordering puts ubongo first). Harden the control node's inbound surface via
base's nftables firewall as INPUT-only default-deny: the forward chain stays
permissive (new base__firewall_input_only knob) so Docker egress + the
libvirt-NAT integration harness keep working, and there is no sshd ListenAddress
change — sidestepping the ip_nonlocal_bind boot-race that sank askari. SSH
allowed from wt0, ssh-from-control (Ansible self), and mamba on the LAN (new
base__firewall_admin_addrs). Harness-validated before an operator-supervised
cutover; the physical console is the permanent break-glass.

Design maps to the four relevant 2026-06-17 incident lessons (FRICTION signals
1/2/3/6).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 09:12:58 +02:00
77a20b8d40 docs(runbook): netbird-client mesh-drop / DNS troubleshooting
Document the 2026-06-18 incident class: a road-warrior laptop losing DNS on a network transition strands NetBird (can't resolve the coordinator FQDN), taking ubongo unreachable until DNS recovers. Adds triage (local DNS vs coordinator), device mitigations (reliable resolvers + hosts-file pin), the non-mesh LAN break-glass to ubongo, and why ubongo is relay-only (deferred mesh-hardening, not a bug) — including the break-glass rule that hardening must preserve.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 22:30:41 +02:00
a23ecd708d Merge feat/integration-testing: local VM integration testing (ADR-025, TODO 2.4)
A stdlib driver (scripts/integration-vm.py) boots throwaway KVM VMs on ubongo mirroring a real host, applies the real playbooks, performs a real reboot, and asserts outcomes - catching the reboot/firewall/Docker class Molecule cannot. Validated end-to-end on real hardware: RED->GREEN acceptance passed (reproduced the 2026-06-17 incident, then proved the docker_host container-forward drop-in survives reboot). Also: claude AI-worker granted NOPASSWD sudo (reverses ADR-015 no-local-sudo; ADR-015/021 + accepted-risk R7, codified in base); 9 shakedown findings in FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:59 +02:00
bc8592616b fix: address final whole-branch review findings
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:28 +02:00
d7bd31babb docs(adr/status): integration-testing harness RED→GREEN validated (ADR-025)
The local-VM integration harness RED→GREEN acceptance passed on real hardware
(2026-06-18): a KVM VM on ubongo reproduced the 2026-06-17 nftables/Docker reboot
breakage (RED) and survived with the docker_host container-forward drop-in (GREEN).

ADR-025: Status updated to PASSED; shakedown learnings section added (UEFI boot
required, claude sudo load-bearing); ADR-021 added to Related.
STATUS.md: integration-harness section updated from PENDING to PASSED; ubongo
entry updated to reflect claude NOPASSWD sudo + sjat-ansible NOPASSWD removal;
last-reviewed date updated.
docs/TODO.md: item 2.4 collapsed to one-line pointer per the file's convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:39:30 +02:00
cc772ff845 docs(adr/security): record claude NOPASSWD sudo model (ADR-015 amend + R7)
The integration-testing shakedown reversed ADR-015's "no local sudo" sub-decision:
the claude AI-worker now has NOPASSWD:ALL sudo on ubongo — without it, virsh,
nft, and journalctl all block during VM diagnosis. Compensating controls:
password-locked account, auditd/Loki attribution, repo-managed revocable drop-in.

ADR-015: dated amendment note in Status + expanded AI-worker identity section.
ADR-021: new §Sudo model (amendment 2026-06-18) — claude=NOPASSWD, sjat=password
required; former sjat NOPASSWD drop-in removed 2026-06-18 (least-privilege cleanup).
accepted-risks.md: R7 added (claude NOPASSWD:ALL on ubongo); last-reviewed updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:39:20 +02:00
3fe6f68316 feat(base): codify AI-worker NOPASSWD sudo (ADR-015 amended)
Add base__ai_worker_user var (default empty), a new operational_access.yml
task file that drops a validated sudoers file for the named user, and wire it
into base/tasks/main.yml after the hardening includes under the `users` tag.

Set base__ai_worker_user: claude in group_vars/control so that applying base
to ubongo is idempotent with the manual /etc/sudoers.d/claude-ai-worker drop-in
already in place. Password remains locked; NOPASSWD is the only sudo path;
actions are attributed via auditd (ADR-021).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:36:31 +02:00
b1aa0f49d9 fix(integration): verify probes :80 without following redirects
Accept caddy's 308 on :80 as proof the DNAT+forward path is alive; don't follow into https (tls internal has no cert for a bare-IP request). This load-bearing end-to-end check is what caught the br-+/br-* nftables-wildcard bug that the string-presence assert missed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:57:47 +02:00
172ae37953 feat(docker_host): container-forward nftables drop-in (reboot-safe Docker forwarding)
base's inet-filter forward chain is policy-drop; on a Docker host that kills published-port DNAT + inter-container forwarding ON REBOOT (nftables loads default-deny before dockerd). This drop-in (loaded via base's /etc/nftables.d/*.nft include at boot) appends the container-bridge accepts so a rebooted Docker host keeps forwarding. Resolves FRICTION 2026-06-17 #1 and the GREEN half of ADR-025's acceptance test. NB nftables wildcard is br-*, not the iptables br-+.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:57:47 +02:00
051c040343 fix(integration): exclude transient .run/ from linters; --- in generated inventory
Running the harness leaves tests/integration/.run/ (gitignored, generated); exclude it from yamllint + ansible-lint so a post-run 'make lint' passes. Also emit a --- doc-start in the generated inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:44:12 +02:00
c7194ca147 feat(integration): allow SSH from the NAT gateway in the askari overlay
base's default-deny firewall would drop the driver's post-reboot SSH from the libvirt NAT gateway; set base__firewall_control_addr to the gateway (by source IP, interface-independent).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
35446538df fix(integration-vm): apt-ready VMs + sudo-read serial console diagnostics
cloud-init package_update:true + block on 'cloud-init status --wait' in up() so apply sees populated apt lists (fresh genericcloud images ship empty lists); dump_diagnostics()/console() read the root:0600 serial log via sudo instead of shutil.copy, which raised PermissionError mid-diagnostics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
83983d739c fix(reverse_proxy): plain {% %} tags so the Caddyfile renders under ansible trim_blocks
The tls-internal/acme_ca knobs used {%- -%} trims validated only against raw jinja2; ansible (trim_blocks=True) double-stripped newlines and collapsed the Caddyfile onto single lines, crash-looping caddy. Match the role's existing plain {% %} style.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:35:15 +02:00
941141e270 docs(friction): capture 9 signals from the ADR-025 harness shakedown
UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu
session-vs-system URI, system-qemu home-traversal, directory-inventory phantom
hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images,
NAT-gateway firewall allow, and the review-vs-hardware coverage lesson.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:30:13 +02:00
f27514860e fix(integration-vm): boot test VMs via UEFI
The Debian 13 genericcloud image triple-faults at the legacy real-mode kernel
handoff under SeaBIOS/q35 (boot-loops at GRUB, no 'Decompressing Linux', no DHCP
lease). Booting via UEFI (OVMF -> efistub) bypasses the legacy entry and boots
cleanly: cloud-init runs, DHCP lease obtained, SSH reachable. Verified end-to-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:13:35 +02:00
65bacb25fa feat(integration-vm): force DHCP via explicit cloud-init network-config
Don't rely on the genericcloud image's network fallback; the seed now carries a
network-config forcing dhcp4 on en* interfaces. A correct prerequisite for the VM
to network once cloud-init processes the seed. (Note: a separate no-DHCP-lease
issue on first real boot is still under investigation — the guest isn't networking
and, under the no-sudo claude model, the VM console/logs aren't introspectable
without libguestfs; see next steps.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 15:05:49 +02:00
e5256696d6 fix(integration-vm): place VM disk/seed/console in CACHE_DIR for system-qemu
Under qemu:///system the hypervisor runs as libvirt-qemu, which cannot traverse
/home/claude — so the overlay/seed/console must live in /var/lib/boma-integration
(group libvirt, world-traversable, created by the integration_test role), not the
repo/home RUN_DIR. The inventory (hosts.yml + group_vars symlink, read by ansible
as claude) stays in RUN_DIR. Verified: virt-install now creates the domain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 14:56:35 +02:00
147eb874ea fix(integration-vm): pin LIBVIRT_DEFAULT_URI=qemu:///system
Bare virsh/virt-install default to qemu:///session for a non-root caller, but
the substrate, /dev/kvm, and the boma-it NAT network live on the SYSTEM libvirtd.
Pin the URI so the driver targets system regardless of who runs it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 14:41:31 +02:00
ed1187d1c3 fix(integration-vm): point ansible -i at hosts.yml, not the run dir
The driver passed -i <RUN_DIR>/ (a directory); ansible's directory-inventory
loader then parsed sibling files (notably 'current', which holds the real host
string 'askari') as INI inventory, creating phantom hosts incl. the real askari
with its full hostvars — violating the single-host safety invariant (and a hard
error in ansible 2.18 on the binary qcow2/seed files). Point -i at the single
hosts.yml file; ansible still loads the adjacent group_vars symlink. (review C1)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 13:04:54 +02:00
f51ae1a13d docs(runbook): integration-testing runbook + pre-flight cross-links
- New docs/runbooks/integration-testing.md: when to use (firewall/
  sshd/boot/Docker changes); make test-integration commands; lower-
  level driver sub-commands; cert tier guidance; diagnostics dir;
  VM inspection (virsh console / SSH); safety invariants; resource
  constraints; adding a new profile; self-validating acceptance test.
- docs/runbooks/new-host.md: pre-flight warning before deploying
  lockout-risky changes (firewall/sshd/boot) while break-glass is open
- docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:59:06 +02:00
4732730515 docs: wire ADR-025 into testing/control-host/risks/status/capacity
- ADR-008: add reboot-survivability gap row + ADR-025 pointer to the
  "not tested in Molecule" table
- ADR-015: reconcile "not a hypervisor" with ephemeral KVM test VMs
  (ADR-025); note ~3 GiB test-VM RAM against the 16 GiB sizing
- accepted-risks: add R6 (le-prod-wildcard PAT + transient TXT records)
- CLAUDE.md: add make test-integration[/-clean] to key-commands;
  add ADR-025 + runbook rows to further-reading
- hardware/reference.md: note one ephemeral KVM test VM on ubongo
- STATUS.md: add integration harness entry (built, lint+pytest clean;
  RED/GREEN acceptance PENDING ubongo live pass); TODO 2.4 stays open

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:51:22 +02:00
edcc347a95 docs(adr): ADR-025 local VM integration testing
Accepted decision to implement ADR-008 Level 2/3 on ubongo via
libvirt/KVM directly: throwaway VM overlays, stdlib-only driver,
tiered cert fidelity, three safety invariants. Addresses the
2026-06-17 mesh-hardening incident's reboot-survivability gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 12:49:52 +02:00
d68734267b feat(make): test-integration / test-integration-clean targets
Add ADR-025 integration-test harness targets to Makefile:
- test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1]
- test-integration-clean (prune stale VM snapshots)

Also add tests/integration/.run/ to .gitignore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:45:38 +02:00
3769c9ebb9 feat(integration): outcome-based verify playbook (DNAT-survives-reboot)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:38:22 +02:00
10121e72d3 feat(integration): askari profile, stub overlay, cert-tier files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:37:32 +02:00
0989f047eb feat(reverse_proxy): tls-internal + acme_ca knobs for integration/staging (ADR-025)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:30:49 +02:00
4fb4cf99c3 fix(integration-vm): boot-id-verified reboot + actionable timeouts + inventory guard (review)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-18 12:28:06 +02:00
68abd67ce6 feat(integration-vm): teardown, prune, console, full cycle + dispatch 2026-06-18 12:21:06 +02:00
8ea9966d88 feat(integration-vm): reboot, verify run, failure diagnostics 2026-06-18 12:20:52 +02:00
d1c91930ac feat(integration-vm): transient inventory + real-playbook apply 2026-06-18 12:20:37 +02:00
fdd4df34b1 feat(integration-vm): network + VM boot (overlay, cloud-init seed, virt-install import) 2026-06-18 12:20:25 +02:00
af76763c16 feat(integration-vm): golden image fetch + SHA512 verification 2026-06-18 12:19:58 +02:00
a8dc3c787a feat(integration-vm): cert-tier + profile + transient inventory rendering 2026-06-18 12:17:37 +02:00
6f53d00b71 feat(integration-vm): cloud-init user-data/meta-data rendering 2026-06-18 12:12:08 +02:00
b5d5dffeaf feat(integration-vm): vm naming, RAM guard, lease IP parsing 2026-06-18 12:11:56 +02:00
64767ac187 feat(integration-vm): driver skeleton + CLI dispatch 2026-06-18 12:11:41 +02:00
ac6a01296a feat(integration_test): KVM/libvirt substrate role on the control node 2026-06-18 12:09:35 +02:00
65533be4d9 docs(plan): implementation plan for local VM integration testing (2.4)
20-task TDD plan: integration_test substrate role, stdlib virsh driver, askari profile, tiered certs, RED->GREEN acceptance, docker_host container-forward fix, ADR-025 + docs. Follows the 2026-06-18 design spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 11:56:04 +02:00
02e1eb7449 docs(spec): design local VM integration testing on ubongo (2.4)
Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 11:35:51 +02:00
69faaf5e43 docs(todo): local VM integration testing (2.4) + screenshot hand-off (10.8)
From the 2026-06-17 mesh-hardening incident: Molecule can't catch
reboot/firewall-x-Docker/boot-order bugs — build local-VM pre-deploy testing
on ubongo (ADR-008 Level 2/3). And a smooth screenshot hand-off for the agent
during incidents.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:27:26 +02:00
958e35e3c3 docs(friction): capture 6 signals from the mesh-hardening 1/3 incident
firewall-breaks-Docker-hosts, ip_nonlocal_bind didn't beat the boot race,
coordinator-host circular bootstrap, NetBird geo-DB FATAL dependency, no
off-site coordinator backup, and reboot-tested-after-removing-break-glass.
For the next /kaizen + the mesh-hardening re-spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:21:19 +02:00
847d9885e2 revert: back out mesh-hardening 1/3 on askari after it broke the Docker host
Incident 2026-06-17: applying base's nftables default-deny (forward policy drop)
to askari — a Docker host — broke container forwarding/NAT on reboot, and the
wt0-only sshd ListenAddress left no break-glass (ip_nonlocal_bind did NOT beat
the boot race). Recovery: disable nftables + restart docker (restore the wiped
NAT masquerade) + force-recreate the coordinator (it FATAL-looped unable to
download its GeoLite2 DB with no egress) -> mesh re-formed.

Back out the enablement so a future deploy can't re-break askari:
- offsite_hosts: base__ssh_listen_mesh_only=false, base__firewall_apply=false
- remove host_vars/askari.yml (manage over the WAN again, not wt0)
- tf/offsite: re-open WAN :22 to ubongo only (break-glass; already applied)

askari now: sshd on all interfaces (Ansible-managed), nftables disabled, WAN :22
open -> stable + reboot-survivable. The base feature code (sshd ListenAddress
option, firewall public zone) stays; it's just not enabled on Docker hosts.
Mesh-hardening 1/3 to be re-spec'd before any retry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 22:16:17 +02:00
b0511179cb feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:51:24 +02:00
cc21344ab1 feat(inventory): manage askari over wt0 + enable mesh-only SSH
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:49:15 +02:00
3b30e70ba5 feat(firewall): public zone + askari's public services in the catalog
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:46:03 +02:00
39d2ad38ca feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed)
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).

Environmental checkpoint applied: the molecule-debian13 container image lacks
procps (no sysctl binary). Added molecule/default/prepare.yml to install procps
and sysctls: {net.ipv4.ip_nonlocal_bind: "0"} to molecule.yml platform so the
ansible.posix.sysctl task can write and read back the value hermetically.
Sysctl file format is net.ipv4.ip_nonlocal_bind=1 (no spaces); verify.yml
grep pattern updated to match ansible.posix.sysctl's actual output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:43:08 +02:00
dfa363cecd docs(plan): mesh-hardening 1/3 — askari SSH onto wt0 implementation plan
5 tasks: base sshd ListenAddress+ip_nonlocal_bind (Molecule), firewall public
zone + askari catalog, inventory wt0 override, TF retire WAN :22, then the live
operator-supervised staged cutover.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:25:59 +02:00
292c204752 docs(spec): mesh-hardening 1/3 — move askari SSH onto wt0
Decomposes the M5 mesh-hardening follow-on into 3 independent sub-specs; this
is sub-project 1. Three-layer SSH-on-wt0 (sshd ListenAddress=mesh + nftables
iifname wt0 + retire the Hetzner WAN :22), ip_nonlocal_bind to beat the
post-boot wt0 bind race (fail-closed), live wt0 fact for the listen addr,
staged cutover with the firewall auto-rollback as the safety gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 20:15:12 +02:00
e5a8e5d3b9 docs(roadmap): Phase 1 complete — point Next step at mesh-hardening follow-on
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:39:08 +02:00
5947ba8756 chore(vault): Forgejo registry_token supplied (operator-minted, encrypted)
registry-login verified end-to-end (docker login -> Login Succeeded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 18:37:11 +02:00
a0762c563e docs(kaizen): bind-mount gotcha + consume 7 signals into the ledger (2026-06-17)
Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a
directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move
all 7 open signals out of FRICTION.md's Open-signals section into the new
2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo
self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to
accept an empty Open-signals section (the goal state after a kaizen pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:50:17 +02:00
c1323a3f29 feat(make): registry-login via vaulted Forgejo token (kaizen)
scripts/registry-login.sh reads vault.forgejo.registry_token and pipes it to
docker login --password-stdin (never echoed, never on argv); 'make registry-login'
wires it with the venv binaries. Adds the operator-minted CHANGEME vault stub
(fill via make edit-vault) and a per-machine prereq note in the claude-code-setup
runbook, so 'make caddy-image-push'/'molecule-image-push' become agent-completable
non-interactively. Consumes the 2026-06-15 signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:50:07 +02:00
39904a778a fix(hooks): scope vault-preflight to staged ansible; catch prose exec re-asks
guard-vault-preflight: block a locked 'git commit' only when the staged set
(git diff --cached, plus -a/--all) contains ansible content matching the
pre-commit ansible-lint hook's files: scope. Docs-/config-only commits never
trigger that hook, so they no longer need the vault — fixing the false block on
docs-only commits. Fails safe to block when unsure.

guard-execution-mode-menu: widen the execution-mode arm to also catch free-form
prose re-asks of the subagent-vs-inline choice ('which execution approach?',
'subagent vs inline', ...), which the literal-menu matcher missed; the push
re-ask is intentionally left to the dont-reask-settled-defaults memory.

Consumes two 2026-06-17 signals in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:49:55 +02:00
8f1c7d47ec fix(reverse_proxy,netbird_coordinator): create scaffold dirs in check mode
Add check_mode: false to the state:directory base_dir tasks so that 'make check'
on a brand-new compose service role creates the scaffold during --check and the
rest of the dry-run (templates + docker_compose_v2 up) can be evaluated instead
of failing on a missing project_src. The directive is inert under a normal
converge (incl. Molecule + its tagged second converge), so role tests are
unchanged. Consumes the 2026-06-16 signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:49:47 +02:00
b0c0150db2 feat(scan): repo-scan rename-incomplete check (kaizen)
When a numbered ADR announces a rename Old->New, flag design-doc lines where
Old still appears in present tense — skipping the announcing ADR, lines that
also name New, and historical/negation cues, and rejecting ADR-NNN tokens as
terms. Structural cousin of stale-deferred; run by /review-repo. Zero findings
on the current tree (the Traefik->Caddy ripple edits have landed). Consumes the
2026-06-14 KEEP-OPEN signal in docs/FRICTION.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:49:41 +02:00
959f9b30b5 feat(statusline): show context-window usage % in the status line
Adds .claude/statusline.sh (reads context_window.used_percentage +
context_window_size straight from the statusLine JSON; green<70/yellow/red
bar) and wires it via .claude/settings.json statusLine. Committed in-repo so
it follows boma to any clone, matching how .claude/ already tracks hooks +
plugins.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:35:47 +02:00
5d14efc864 docs: Phase 1 complete — clients enrolled + NetBird client runbook
mamba + work laptop enrolled in the mesh → ubongo reachable from anywhere; the
mobile-access goal is met and Phase 1 (remote access) is complete. Adds
docs/runbooks/netbird-client.md (reusable client-enrollment runbook) + STATUS/
ROADMAP flips + CLAUDE.md reading-table entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:11:32 +02:00
8d2a064542 chore(vault): NetBird setup_key supplied (operator-minted, encrypted)
Operator replaced the CHANGEME with a real reusable scoped setup key via
make edit-vault (re-encrypted in place). Encrypted ciphertext only; no plaintext.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:40:58 +02:00
4c8fb9e03b docs: M5 mesh enrollment — ubongo + askari on the mesh
STATUS: base mesh concern built + applied; ubongo (100.99.146.14) + askari
(100.99.226.39) enrolled, link verified; ubongo agent-management access (sjat key
+ NOPASSWD sudo) recorded. ROADMAP M5: infra done, laptops = operator step,
mesh-hardening split out as the deferred follow-on. FRICTION: docs-only-commit rbw
guard + control-node self-management access gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:40:02 +02:00
d202b89480 feat(base): vault setup_key stub + enable mesh on ubongo + askari
vault.netbird.setup_key: CHANGEME (operator mints a reusable scoped key after the
dashboard /setup). base__mesh_enabled: true for control (ubongo) + offsite_hosts
(askari) so the base 'mesh' concern enrols them. Enrollment only — no firewall change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:12:28 +02:00
9b3f8f826f test(base): molecule coverage for the mesh concern (manage-off no-op)
Converge enables mesh with base__mesh_manage:false (+ dummy key) so the include
path runs hermetically; verify asserts netbird is not installed — proving the
concern is a clean no-op when the live actions are gated off. Existing firewall/
ssh/fail2ban assertions unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:11:02 +02:00
44c4978b5f feat(base): NetBird agent enrollment concern (mesh)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:08:23 +02:00
98eb09d8ba feat(base): add the 'mesh' concern tag (NetBird agent, ADR-016)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:01:33 +02:00
4cfc3cddd5 docs(friction): re-asked operator about push + execution mode (settled)
I re-surfaced two already-settled decisions as questions (push to origin; subagent
vs inline) at the M5 handoff. The existing execution-mode guard only matches the
writing-plans menu's literal text, so free-form prose re-asks slip through. Default:
push as backup and go subagent-driven without asking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:58:26 +02:00
55776fb03c docs(plan): M5 mesh-enrollment implementation plan
8 tasks: build the base 'mesh' concern + tag + vault stub + per-host opt-in
(autonomous), operator handoff for /setup + setup key, gated live enrol of
ubongo + askari, operator laptop enrol, docs. Reachability-only; lockdown deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:49:28 +02:00
4142bb15f8 docs(spec): M5 mesh-enrollment design (reachability-only)
base 'mesh' concern enrols NetBird agents on ubongo + askari via a reusable scoped
setup key (vault); laptops enrolled by the operator. Reachability via the default
peer policy; the base nftables default-deny on ubongo + ACL tightening are deferred
to a follow-on. Resolves ROADMAP M5 design; next: writing-plans.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:44:13 +02:00
94dd6da14c docs(netbird): describe gRPC routing as the deployed Content-Type matcher
README/SECURITY said gRPC was path-matched (/management.ManagementService/* etc.);
the deployed Caddy route selects gRPC by Content-Type: application/grpc* (NetBird's
own external-proxy example). Reconciled the prose to what actually runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 07:54:09 +02:00
684718f4a5 docs(netbird): M4b done — STATUS/ROADMAP/risks/friction
netbird_coordinator built + applied to askari (first service role, dashboard live).
STATUS: new "real and working" row + askari/coordinator rows updated. ROADMAP: M4b
done, M5 (peer enrol) next, recorded the v0.72.4 combined-container/embedded-Dex/
no-Coturn reality. accepted-risks R3: Coturn -> STUN wording. FRICTION: single-file
bind-mount stale-inode gotcha + check-before-first-deploy artifact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 07:48:53 +02:00
3a31b8e6f4 fix(reverse_proxy): bind-mount the Caddy config dir so reload sees changes
The Caddyfile was bind-mounted as a single file. ansible.builtin.template writes
atomically (temp + rename), so a re-render swaps the file's inode while the running
container keeps the old one — `caddy reload` then re-read stale config and silently
no-op'd ("config is unchanged"), so new routes never loaded. Surfaced deploying the
NetBird route: Caddy never requested its cert. Fix: render to ./caddy/Caddyfile and
mount the ./caddy DIRECTORY at /etc/caddy — directory mounts reflect inode swaps, so
graceful `caddy reload` works. Proven on askari: atomic replace in the host dir is
visible inside the running container.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-16 07:44:45 +02:00
0e8d448f2b feat(offsite): apply netbird_coordinator after reverse_proxy
NetBird joins the boma Docker network that reverse_proxy creates, so it's
ordered last. Carries its netbird_coordinator role-name tag (check-tags).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:05:12 +02:00
070d6f293b docs(netbird): service-role standard files (SECURITY/VERIFY/ACCESS/BACKUP)
Author the four ADR-mandated service-role docs for netbird_coordinator and
add the cross-role access__*/backup__* data (ADR-021/022). First stateful
service: backup__state=true; off-site capture pending the fisi pull node.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 18:01:29 +02:00
1333ec181f feat(reverse_proxy): raw-directive route type; wire NetBird (gRPC/WS) route
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:55:05 +02:00
3762be4622 feat(netbird): vault secrets — auth_secret + datastore_key
Self-generated random values for the NetBird coordinator: auth_secret (relay/JWT
shared secret) and datastore_key (SQLite store encryption, base64 32 bytes with
padding). Wired into roles/netbird_coordinator config.yaml via vault.netbird.*.
No CHANGEME — both are agent-generatable (not operator-supplied). The M5 peer
setup key is a runtime dashboard artifact, added to vault when M5 wires it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:52:16 +02:00
ab1b0678ab feat(netbird): coordinator service role (combined server + dashboard, v0.72.4)
First real service role. NetBird v0.72.4 self-hosted control plane: single
netbirdio/netbird-server:0.72.4 (management + signal + relay + STUN + embedded
Dex) plus netbirdio/dashboard:v2.39.0, both on the shared boma Docker network so
the M4a Caddy fronts them. Renders docker-compose.yml + config.yaml (secrets from
vault.netbird.*, no_log) + dashboard.env. STUN 3478/udp host-exposed; everything
else via the proxy. netbird_coordinator__manage gates the compose-up for Molecule.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 17:49:57 +02:00
19e675fa5a docs(friction): log registry-push auth gotcha (no creds in vault)
Building images is fully automatable; pushing to the Forgejo registry needs an
interactive docker login, and registry creds aren't in vault — so an agent can't
complete a push. Captured for the next kaizen review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 06:58:45 +02:00
b3468b34e4 docs: record Caddy/Gandi DNS-01 as resolved + proven (was M4a deferral)
ADR-024 Status/Consequences, STATUS.md, ROADMAP M4a, and the FRICTION ledger now
record that the DNS-01 path is built and proven, with the root cause of the M4a
failure (version skew: pre-Bearer libdns/gandi sent the deprecated Apikey header;
plus building on a Hetzner IP). Traefik was reconsidered and rejected again — lego's
Gandi provider has the same PAT-vs-Apikey question, so it would not have helped.

Dated review reports and spec/plan snapshots are left as historical records.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 06:57:55 +02:00
6e38693499 feat(reverse_proxy): optional ACME DNS-01 via Gandi (wildcard / LAN-only)
Adds a per-instance DNS-01 mode to the Caddy role for mesh/LAN-only hosts that
cannot satisfy HTTP-01. Default behaviour (vanilla caddy:2 + HTTP-01, what askari
runs) is unchanged.

  - reverse_proxy__acme_dns_provider: "" (HTTP-01) | "gandi" (DNS-01)
  - reverse_proxy__image: override to the custom caddy-gandi image for DNS-01
  - Caddyfile gains a global `acme_dns gandi {env.GANDI_BEARER_TOKEN}` block
  - the PAT (vault.gandi.pat) renders into a host-only 0600 env file (no_log),
    loaded by compose only when DNS-01 is enabled

Verified: the custom image issues a real wildcard cert (*.dns01test.wingu.me)
end-to-end against LE staging via Gandi DNS-01; `caddy validate` accepts
`acme_dns gandi` on the custom image and rejects it on vanilla caddy:2. Molecule
(HTTP-01 default path) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 06:57:47 +02:00
d407aeabb2 feat(docker): custom Caddy image with the Gandi DNS-01 plugin
Compiles caddy-dns/gandi v1.1.0 into Caddy v2.11.4 via xcaddy so mesh/LAN-only
hosts (no public A-record) can issue certs via ACME DNS-01. Pinned per ADR-011/014.

The M4a attempt failed for two reasons, both addressed here:
  - built on a Hetzner IP -> Google's Go module proxy 403s those ranges. The
    Makefile target is documented to build on ubongo, then push to Forgejo.
  - older libdns/gandi sent Gandi's deprecated Apikey header. v1.1.0 sends the
    PAT as Authorization: Bearer to api.gandi.net/v5/livedns.

make caddy-image / caddy-image-push mirror the molecule-image targets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 06:57:38 +02:00
293c1f88d8 docs(todo): collapse done items to one-line pointers; open-only convention
TODO had accreted multi-line DECIDED/DONE summaries duplicating the ADRs they
cite. Collapsed every done item to a one-line "~~task~~ -> ADR-NNN" pointer and
added an "open items only" convention note up top. Item numbers are stable
cross-references (ROADMAP/STATUS/ADRs/scripts cite them) so they are PRESERVED,
not renumbered — verified all externally-referenced numbers survive. 176->136 lines.
No new ledger: the record already lives in the ADRs / STATUS.md / FRICTION ledger.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 22:00:53 +02:00
13ae674cc9 chore(kaizen): first /kaizen run — curate 12 friction signals
Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:46:23 +02:00
d1e1e38879 feat(kaizen): nudge in /review-repo; STATUS + TODO
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:27:23 +02:00
8d2f564382 feat(kaizen): /kaizen command — interactive friction curation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:26:21 +02:00
fd1e83a378 fix(kaizen): scope still_exists to repo paths; test age nudge; tidy --today
- Add REPO_DIRS constant; still_exists now only checks tokens that start
  with a known repo top-level dir, ignoring plugin names (caddy-dns/gandi),
  make command fragments (tf-init/plan), and role-relative paths.
- Add test_still_exists_ignores_non_repo_tokens (was failing before fix).
- Add test_nudge_line_overdue_on_age to close coverage gap on age threshold.
- Add load_signals docstring.
- Replace manual --today date parsing with datetime.date.fromisoformat type
  converter so malformed dates give a clean argparse error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:25:03 +02:00
b185ac4765 feat(kaizen): friction-scan CLI (--json default, --nudge)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:18:16 +02:00
c6f66ee634 feat(kaizen): recurrence count + referenced-path existence
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:17:39 +02:00
72b9262f34 feat(kaizen): parse tag/first_seen/age per signal
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:17:03 +02:00
859732b04d feat(kaizen): friction-scan section extraction + signal split
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:16:36 +02:00
d14639e80a docs(plan): /kaizen command — implementation plan (TODO 11)
7 tasks: friction-scan.py (TDD, --json/--nudge) + tests; kaizen.md command;
/review-repo nudge hookup + STATUS/TODO; dogfood run. Mirrors /review-repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:09:29 +02:00
1a0e30e278 docs(spec): /kaizen — kaizen-loop command (TODO 11)
Curate-only consume pass over FRICTION.md Open signals: interactive guided
session, add/change/park/remove verdicts (park-with-resurrection-trigger to
protect out-of-phase tooling on a solo project), single source = FRICTION.md,
ledger is the durable record. Mirrors /review-repo (command md + stdlib scanner).
Stage 1 on-demand + stage-2 nudge; headless/cron deferred (TODO 11.3).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:05:09 +02:00
e5867422d0 docs(todo): defer kaizen-loop automation to the notify + cron stack
Per brainstorm: ship the on-demand command + recurrence/age nudge first;
revisit a scheduled headless (report-only) run once ntfy + scheduled jobs exist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 20:49:26 +02:00
f821006e9e docs(friction): log 2026-06-14 review+follow-up signals
Three new Open signals: ansible-lint no-role-prefix vs ADR-021/022 access__/
backup__ conventions (first service role); Molecule tag-propagation now testable
via tagged converge + full-then-partial; ADRs over-claiming cross-doc reconciliation
(repo-scan check candidate, cousin of stale-deferred).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 20:28:15 +02:00
9e0c264658 docs: reconcile lower-severity review findings (O9-O24)
- ADR-007: document ubongo on the legacy V4 net at 10.20.10.151 (transitional,
  outside the planned srv /24 until the LAN is re-cut) (O10); single authoritative
  boma.baobab.band -> boma.wingu.me transition note already added earlier
- terraform tfvars.example + variables.tf (both envs): pve01 -> pve0 and
  <host>.boma.baobab.band per ADR-007 naming (O11)
- ADR-012/013/015/016/017/018: convert "See also:" prose to `## Related` sections
  placed after Consequences, matching ADR-014/019-023 (O13)
- docs/README + inventories/README: list the missing subdirs / offsite_hosts +
  offsite.yml merge behaviour (O14, O29 note)
- ADR-009: drop the retired `nyumbani` example; use vaultwarden.wingu.me split-horizon (O19)
- ROADMAP M2: askari shipped as cx23/x86 (CAX11/ARM out of stock) (O20)
- ADR-020: 80/443/3478 opened in M4a (past tense); coordinator role is M4b (O21)
- netbird -> netbird_coordinator across ROADMAP M4b, the M4b plan, ADR-024 (O23)
- ADR-024: align the M1 DNS-01 wildcard scope wording with ROADMAP (O24)
- capacity-scan.py: read the inventory directory so offsite.yml (askari) is seen (O28)
- tf_to_inventory.py: generated header now warns it overwrites the manual control node (O9)
- tests/tags.yml: proxy concern comment Traefik -> Caddy (missed in the O3 sweep)

O9's existing stub hosts.yml header stays as-is (generator-owned, hook-protected);
the fix lives in the generator for the next regeneration. make lint + pytest (57) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:40 +02:00
9b5851ba4b chore(roles): role/test hygiene from review (O16,O17,O25,O26)
- dev_env .zshrc: drop the rclone alias (not installed) and guard the direnv
  hook with `command -v direnv` so a missing direnv doesn't error every shell (O16)
- dev_env oh-my-posh: tag the zen.toml theme deploy `config` (it renders config to
  disk like the per_user dotfiles); the include now carries packages+config so a
  `--tags config` run re-renders the theme while the binary install stays packages
  only (O17). Verified via `molecule converge -- --tags config`.
- drop the non-vocabulary `tags: [verify]` from molecule verify playbooks across
  base/docker_host/public_dns/reverse_proxy (check-tags exempts molecule anyway) (O25)
- reverse_proxy templates: add the `{{ ansible_managed }}` header (ADR-024 §1.2) (O26)

make lint green; dev_env + reverse_proxy molecule green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:31:23 +02:00
175777e36a docs: reconcile 2026-06-14 review findings (O1-O7,O18,O22)
- STATUS: docker_host is built+applied, not scaffold-only (O1)
- ADR-004: backup points to ADR-022, not "out of scope"; service-role file
  table gains ACCESS.md + BACKUP.md rows (O2, O5)
- Finish Traefik->Caddy: ADR-008/011/017/019, CAPABILITIES, TODO (O3); scope
  ADR-024's custom-image/NetBird claims to the deferred DNS-01/M4b paths (O22)
- ADR-016/017/018 now lead with ## Status per ADR-023 (O4)
- ADR-002: caveat `PLAYBOOK=upgrade` as planned/unbuilt (O6)
- CAPABILITIES: carve out ubongo's dev_env from the nvim/tmux exclusion (O7)
- ADR-007: one authoritative boma.baobab.band -> boma.wingu.me transition note (O18)
- new-host Part E: note ubongo is managed as sjat, ansible-user bootstrap pending (O15)

O9 (hosts.yml header) left open: the file is generator-owned (hook-protected);
fixing it needs a tf_to_inventory.py change or a tf-inventory run, not a hand-edit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:33 +02:00
cb8f924d4b docs(reverse_proxy): service-role SECURITY/VERIFY/ACCESS records (O12)
reverse_proxy is the first built+applied service role; add the per-service
records CLAUDE.md/ADR-002/008/017/021 require. Add access__*/backup__* data to
defaults as the source of truth (ADR-021/022). reverse_proxy is stateless (ACME
certs re-issue via HTTP-01), so it declares backup__state: false with a reason
rather than a BACKUP.md (ADR-022 convention).

The access__*/backup__* cross-role field names intentionally don't carry the
reverse_proxy__ prefix, so each is marked `# noqa: var-naming[no-role-prefix]`
(ansible-lint has no per-prefix allowlist; rule stays enabled elsewhere).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:23 +02:00
718781053f fix(dev_env): make concern tags reach included tasks (O8)
Dynamic include_tasks only filter on the include's own tags, not their
(untagged) contents — so `--tags packages` ran none of the neovim/oh-my-posh/
nodejs installs, and `--tags users|config` never entered per_user.yml. Add
`apply: tags:` to all four includes (mirroring base/tasks/main.yml) and tag the
dev_env__home getent+set_fact preflight `always` so a partial run still resolves
the home dir before the dotfile/stow tasks consume it.

Molecule: add a config-only converge play for a fresh user + a verify assertion.
Proven with `molecule converge -- --tags config` (idempotent, home resolved).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 19:06:15 +02:00
64f1e821d8 docs(review): 2026-06-14 repo audit — M4a doc drift + Traefik→Caddy lag
11 safe auto-fixes (docs/comments only): reverse_proxy meta stale DNS-01
description, base/playbooks/scripts/terraform/public_dns README build-state,
CAPABILITIES reverse-proxy Traefik→Caddy, README ADR list → 024, TF cax11→cx23
stamps, public_dns wildcard DNS-01→HTTP-01 comment. 29 open findings reported.
make lint green. No stale-deferred (ADR-011 open questions still open).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:37:54 +02:00
e3461375f5 docs(plan): M4b — NetBird coordinator service role
Capture NetBird's configure.sh reference for a pinned version → translate into
boma role templates (compose + management.json + dex/openid + turnserver),
external-proxy mode behind the M4a Caddy (netbird.askari.wingu.me). First service
role: full ADR-004 standard files; secrets generated/CHANGEME-stubbed (setup key
for M5). Gated live deploy + verify.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:20:04 +02:00
1862b7a828 docs(m4a): HTTP-01 for askari; ADR-024 cert-method-follows-exposure; STATUS/roadmap/friction
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:14:38 +02:00
b7e919d6b3 refactor(reverse_proxy): vanilla Caddy + HTTP-01 (drop DNS-01 custom image)
Switch from a custom caddy-dns/gandi image built on-host to the official
caddy:2 image with per-host ACME HTTP-01 certificates. Removes the
Dockerfile, env.j2 (Gandi token), on-host image build/ship/load tasks,
the caddy-image Makefile target, and the wildcard DNS-01 Caddyfile.
Each route now gets its own server block and automatic certificate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:11:20 +02:00
9c169561d7 feat(offsite): *.askari.wingu.me wildcard + offsite.yml (docker_host + reverse_proxy)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:39:44 +02:00
1ee343dfca feat(tf): open Caddy 80/443 + NetBird 3478 on askari (public_web)
hetzner_vm gains a public_web bool (default false); offsite sets it true. Firewall
adds 80/443 tcp + 3478 udp from anywhere (SSH-from-ubongo preserved). For M4 Caddy
+ NetBird.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:38:51 +02:00
50b6445bdd feat(reverse_proxy): Caddy role (Gandi DNS-01, on-host image build, route catalog)
Implements the Caddy reverse proxy role (ADR-024): builds boma/caddy-gandi:latest
on-host (caddy-dns/gandi plugin), renders Caddyfile from route catalog, brings
Compose project up. Adds community.docker to requirements.yml, production group_vars,
and a caddy-image Makefile target.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:36:58 +02:00
456c27d12b feat(docker_host): install Docker engine + compose plugin
Implements the docker_host role tasks: prerequisites, /etc/apt/keyrings
directory (ordered before the GPG key write), Docker APT key + repo, and
docker-ce/cli/containerd.io/compose-plugin install. Daemon hardening and
nftables.d integration remain deferred to Phase 2 (cluster + base firewall).
Updates defaults, README, and molecule verify to assert docker --version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:28:51 +02:00
d10f6de84b docs(adr): ADR-024 — Caddy is boma's reverse proxy
Adds ADR-024 pinning Caddy (xcaddy + caddy-dns/gandi) as boma's reverse
proxy, superseding the soft Traefik assumption in the roadmap and ADR-017
prose. Updates CLAUDE.md Further reading table and ROADMAP.md Phase-2 step 5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:28:42 +02:00
dd8c6825ba docs(plan): M4a — Docker + Caddy reverse proxy platform
First of M4's two build phases: docker_host (Docker engine), custom xcaddy Caddy
image (caddy-dns/gandi), reverse_proxy role (Caddyfile from a route catalog,
DNS-01 wildcard cert for *.askari.wingu.me via vault.gandi.pat), ADR-024 (Caddy is
boma's reverse proxy), firewall 80/443 + DNS, proven by serving a test route over
TLS. M4b (NetBird) follows, reading NetBird's current self-host compose then.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:20:53 +02:00
65cf20a993 docs(spec): M4 — NetBird coordinator on askari + Caddy reverse proxy
Caddy becomes boma's standard reverse proxy (amends the soft Traefik assumption;
new ADR) with Gandi DNS-01 certs (custom xcaddy image, reuses vault.gandi.pat) —
the only cert path for mesh/LAN-only services. NetBird self-hosted in
external-proxy mode (embedded Dex), compose rendered from boma templates
(ADR-004/013). Three roles: docker_host (first real content), reverse_proxy (new,
Caddy), netbird (first service role w/ full ADR-004 standard files). Firewall +
DNS amendments; backup execution deferred (fisi). caddy-dns/gandi + NetBird
self-host facts verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 17:19:21 +02:00
181a02fd3a docs(friction): include_tasks tag-propagation + check-mode gotchas (M3)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:56:23 +02:00
9d787a4f53 docs(base): M3 done — ssh hardening + fail2ban applied to askari; STATUS + roadmap
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:55:22 +02:00
db1e5db138 fix(base): propagate hardening tag to included tasks; check-mode-safe fail2ban
Two bugs caught by the live make check/deploy on askari:
- include_tasks with a tag selects the include but NOT its tasks, so --tags hardening
  ran nothing. Use apply: {tags:} to propagate (also fixed the firewall include).
- fail2ban service start + restart handler fail in a first-run --check (package not
  installed yet); guard both with when: not ansible_check_mode so check is clean.
Applied to askari: SSH hardened, fail2ban active, ping still works (no lockout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:54:23 +02:00
a111a20cc8 test(base): Molecule coverage for ssh hardening + fail2ban
Add explicit base__ssh_authorised_keys: [] default to prevent
undefined-var errors in Molecule. Extend verify.yml with sshd
drop-in validation, PasswordAuthentication check, and fail2ban
jail assertion. Pre-create /run/sshd in ssh.yml so sshd -t
works in containers before the service has ever started.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:47:42 +02:00
deec75de0f feat(base): ssh hardening + fail2ban (hardening concern, ADR-002)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:42:56 +02:00
22021210c4 feat(make): optional LIMIT= and TAGS= passthrough on check/deploy
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:41:59 +02:00
cff368ece2 docs(spec,plan): M3 — base ssh hardening + fail2ban
ADR-002 baseline (key-only, no root, fail2ban 5/1h) as two base task files under
the existing 'hardening' concern tag; applied to askari by tag (NOT the host
firewall — that's mesh-gated to avoid lockout; Hetzner Cloud Firewall is the
perimeter until M5). NetBird agent deferred to M4. Adds a LIMIT=/TAGS= passthrough
to make check/deploy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:38:38 +02:00
a1c0f4814b feat(askari): publish askari.wingu.me; mark M2 applied (askari live)
askari provisioned + bootstrapped (cx23/hel1/Debian 13.5, cloud-init ansible user
+ sudo, cloud firewall SSH-from-ubongo-WAN, reachable, in offsite_hosts). Added
askari.wingu.me A -> 77.42.120.136 via public_dns. STATUS: askari moves to
'Real and working today'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:26:26 +02:00
917005174a feat(tf): provision askari — cx23/hel1 (CAX11 ARM was out of stock)
ARM (cax11) unavailable in all EU locations 2026-06-14; fell back to cx23 (x86,
same 2/4/40 spec, cheaper in hel1). Server created (id 141153963); offsite.yml
generated into the directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:23:01 +02:00
e83c777b44 docs(friction): TF child-module required_providers gotcha (caught by live init)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:15:23 +02:00
839fc632a1 fix(tf): declare required_providers in modules; pin offsite lock
terraform init failed: child modules using non-hashicorp providers must declare
required_providers, else TF infers hashicorp/{hcloud,proxmox} (nonexistent). Add
versions.tf to hetzner_vm AND proxmox_vm (same latent bug, never caught because
Proxmox TF was never init'd). Track the offsite lock (hcloud 1.65.0). Caught by
running 'make tf-init/plan TF_ENV=offsite' on ubongo — static review missed it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:14:05 +02:00
9d4a49d49d feat(vault): CHANGEME placeholder convention + check-vault flags them
Streamline the recurring secret-entry friction: the agent stubs a needed secret as
vault.<service>.<key>=CHANGEME with a what/how-to-obtain comment, wires the code,
and commits; the operator fills it via make edit-vault (real value never hits chat).
check-vault now lists outstanding CHANGEME placeholders so none are forgotten.
Convention documented in CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 15:40:37 +02:00
09b0aad342 fix(tf): cloud-init heredoc column-0 + firewall uses ubongo's WAN IP
Review catches: (1) <<-EOT strips by the closing marker's indent, so the
cloud-config body must match it (2 spaces) for '#cloud-config' to land at column
0; (2) the Hetzner Cloud Firewall filters public traffic, so ssh_admin_cidrs is
ubongo's WAN/egress IP, not its LAN address — a private CIDR would lock SSH out of
the live VPS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:19:45 +02:00
3588904528 docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS (apply pending)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:09:20 +02:00
fd86ec6848 test(tf): lock the offsite_hosts inventory handoff
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:06:26 +02:00
07af037ff3 feat(make): offsite TF token injection + directory inventory + tf-inventory-offsite
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:05:41 +02:00
127ade59a3 feat(tf): offsite environment — askari (CAX11/hel1/debian-13)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:03:31 +02:00
bbc287900a feat(tf): hetzner_vm module (server + firewall + ssh key + cloud-init)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 12:03:01 +02:00
29921428c4 docs(plan): M2 — askari provisioning (Terraform + Hetzner Cloud)
9-task plan: verify hcloud facts; hetzner_vm module (server+firewall+ssh+cloud-init);
offsite env (CAX11/hel1/debian-13, local state); Makefile token-injection + directory
inventory + tf-inventory-offsite; offsite-handoff pytest; init/validate/plan; GATED
apply (billed VPS) + bootstrap; ADR-006/009/020/007/016 amendments. Resolves the
inventory-handoff open item via a directory inventory.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 11:53:08 +02:00
993d7885e4 docs: mark M1 applied (STATUS); log item.values + Gandi null-MX gotchas
M1 public_dns applied to wingu.me (purge + SPF/DMARC, idempotent). Friction:
item.values dict-method collision, Gandi null-MX rejection, and the apply=false-
Molecule/data-only-pytest gap that let both bugs reach a live apply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:58:03 +02:00
76bd1d63fc fix(public_dns): index loop keys with item['key'] not item.key
item.values resolved to the dict's built-in .values() METHOD, not the 'values'
key, so gandi_livedns received '<built-in method values of dict object at 0x..>'
as the TXT value — garbage AND non-idempotent (the address changes each run).
Bracket-index all loop fields. Caught only by the live apply (apply=false Molecule
+ data-only pytest both missed it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:57:23 +02:00
078d1ad9d9 fix(public_dns): drop null-MX (Gandi rejects '0 .'); remove MX instead
Gandi LiveDNS rejects the RFC-7505 null-MX value '0 .' ('invalid format for MX
record'), which failed the live apply. No MX + no apex A = no mail delivery, and
SPF -all + DMARC reject still prevent spoofing — so remove Gandi's seeded MX (add
@/MX to absent) rather than declare a null-MX present. Assert now requires an SPF
@/TXT record; tests + Molecule sample updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:53:54 +02:00
3cb6436ad2 docs(adr-007): fix askari FQDN to askari.wingu.me (review nit)
The naming-table amendment left the 'External monitoring' prose saying
askari.baobab.band; askari is greenfield (never on baobab.band), so its FQDN is
askari.wingu.me, off-site tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:44:21 +02:00
f170ffd936 docs(public_dns): amend ADR-007 to wingu.me/Gandi; resolve TODO 4; STATUS + CAPABILITIES
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:38:45 +02:00
e247af6e55 test(public_dns): Molecule scenario (apply disabled, no live API)
Converge runs in CI; the no-op apply=false scenario adds no local signal over
the pytest, and the test image is on an unreachable registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:36:40 +02:00
a0a3e4d356 feat(public_dns): dns.yml play (control-node, Gandi LiveDNS)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:35:30 +02:00
bd84dd0213 feat(public_dns): role tasks, defaults, meta, README
Implement M1: manage wingu.me public DNS zone at Gandi LiveDNS via
community.general.gandi_livedns (PAT from vault.gandi.pat). Adds
assertion guard for domain + null-MX, present/absent record loops
with run_once, and apply-gate for Molecule dry-run mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:34:42 +02:00
9311968363 feat(public_dns): wingu.me record data + validation test
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:33:07 +02:00
91ad629c02 secrets(vault): rotate Gandi PAT (via make edit-vault)
The chat-exposed PAT was rotated at Gandi and swapped in via the new edit-vault
target; commit the re-encrypted vault so the rotation is versioned.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:30:58 +02:00
70c302d7e5 scaffold(public_dns): empty role structure
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:30:02 +02:00
6f5c7b2bfb deps: add community.general for gandi_livedns (public_dns)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:29:57 +02:00
e96480692d docs(friction): execution-mode menu recurred despite the 06-10 mechanical fix
5th occurrence (06-14): asked the subagent-driven/inline menu at the M1 plan
handoff. The 06-10 ledger claims a Stop hook blocks this; it didn't fire. Flag to
verify the hook is present + its matcher catches the writing-plans menu wording.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:26:43 +02:00
b131ee317e docs(plan): M1 — public_dns implementation plan
Bite-sized TDD plan: add community.general; scaffold public_dns; wingu.me record
data + pytest; role tasks (gandi_livedns present/absent loops, apply toggle);
Molecule (apply=false, no live API); dns.yml play; gated live run on ubongo
(purge Gandi defaults + anti-spoof baseline + dig verify); ADR-007 amendment +
TODO 4 resolution + STATUS/CAPABILITIES.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:23:26 +02:00
602550fdaa docs(spec): M2 — provision askari via Terraform + Hetzner Cloud
askari is provisioned as IaC: Terraform owns its existence too, generalizing
ADR-006 from "Proxmox VM existence" to Proxmox + Hetzner (new hetznercloud/hcloud
provider, hetzner_vm module, offsite stack with local state). CAX11 (ARM) in
Helsinki on Debian 13, behind a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo
now; NetBird ports in M4). Token via TF_VAR_hcloud_token from vault.hetzner.token.
Handoff stays ADR-009-shaped (tf_to_inventory.py extended to emit askari into
offsite_hosts). State in the ADR-022 backup scope; DR via terraform import.

Amends ADR-006/009/020/007/016. Point ROADMAP.md M2 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 10:12:10 +02:00
32d480efcf docs(spec): note project (boma) vs domain (wingu.me) in the naming scheme
Decided to keep the project named boma with wingu.me as its domain (boma was not
available as a domain). Record why the infra tier reads <host>.boma.wingu.me so it
isn't re-litigated; folds into the ADR-007 amendment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:47:13 +02:00
79f2315eee feat(make): add edit-vault + check-vault targets
`make edit-vault` runs `ansible-vault edit` (decrypt → nvim → re-encrypt on :wq,
abort on :cq) so editing the vault is one step with no plaintext left in the work
tree, then validates structure. `make check-vault` runs scripts/check-vault.py:
decrypts in-memory, asserts valid YAML with secrets under the nested `vault:` map
and no empty leaves, and prints a values-masked structure view (comments visible,
secrets never printed). Both default to the production all-vault; override VAULT=.

Update the vault header comment, CLAUDE.md (command table + Secrets section), and
scripts/README to point at edit-vault (note check-vault.py is the one venv-
dependent helper, by design).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:36:15 +02:00
43e5a4aa53 secrets(vault): add Gandi LiveDNS PAT as vault.gandi.pat
Personal Access Token for wingu.me LiveDNS, used by the M1 public_dns role via
community.general.gandi_livedns. Stored under the nested vault.<service>.<key> map
(CLAUDE.md); the placeholder canary is preserved. Verified the token authenticates
+ is scoped to wingu.me, and that the file round-trips (decrypts to the expected
structure). PAT to be rotated after M1 (transmitted in plaintext during setup).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:14:10 +02:00
f7fac5f5e3 docs(spec): M1 — finalize for wingu.me (greenfield), record Gandi-defaults purge
boma's domain is wingu.me (registered at Gandi; 'wingu' = Swahili for cloud).
Replace the parametric <boma-domain> placeholder with wingu.me throughout. The
zone was NOT empty — Gandi auto-seeded 13 default records (parking A, www redirect,
a full Gandi mailbox set), so M1 includes a one-time purge to a clean baseline plus
an anti-spoof null-mail set (null MX, SPF -all, DMARC reject) since wingu.me sends
no mail. Domain-pick open item closed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 09:14:10 +02:00
7a47dd9dec docs(spec): M1 — public DNS migration to Gandi (DNS-as-code) design
Settles the M1 design: full registrar transfer Cloudflare -> Gandi; three-tier
naming scheme (host.boma / service.bare / service.askari), nyumbani dropped,
mesh/LAN-only default; public-DNS-as-code via a control-node `public_dns` role
driven by group_vars data, using community.general.gandi_livedns with a PAT
(api_key is deprecated/rejected by Gandi — verified per ADR-014). Stale records +
unused MX cleaned by omission. Cert scope is DNS+PAT only (issuance deferred to
M4/Phase 2). Human/agent division of labour + token-scoping recorded.

Resolves TODO 4 and review finding O12 once the ADR-007 amendment lands. Point
ROADMAP.md M1 at the spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 23:17:19 +02:00
be2679cc66 docs(roadmap): record decided DNS naming scheme in M1
Three-tier scheme: <host>.boma.baobab.band (infra, internal) /
<service>.baobab.band (home, split-horizon, mesh/LAN-only default) /
<service>.askari.baobab.band (off-site, public). nyumbani dropped; mesh carries
the baobab.band match-domain to road-warriors; *.baobab.band DNS-01 wildcard
certs via Gandi API. Resolves TODO 4 and review finding O12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 22:17:28 +02:00
3cfcb1c2e9 docs(roadmap): add ROADMAP.md — remote-access-first build order
High-level build order for the project (Approach A): one Off-site/Remote-access
track first (Gandi DNS-as-code -> askari -> NetBird control plane -> enroll
ubongo + road-warrior laptops -> harden), a procurement gate sized by
/capacity-review, then the Cluster track. Sequences the docs/TODO.md backlog into
milestones and records why the order is what it is.

Decisions captured this session: Gandi over Cloudflare is values-driven and
independent of NetBird (sequenced first so records are born at Gandi); public DNS
managed as code (Ansible, consistent with internal DNS + Terraform-owns-no-DNS);
NetBird-on-ubongo before base default-deny (chicken-and-egg); cluster procurement
gated on patterns proven on two cheap hosts.

Wire ROADMAP.md into CLAUDE.md's Further-reading index and point TODO.md at it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 22:12:38 +02:00
03d33f83dd fix(O1): scaffold docker_host role so make lint passes on main
playbooks/site.yml imports the docker_host role, but it didn't exist, so
ansible-lint's syntax-check failed on a clean checkout — breaking CLAUDE.md's
"main must always work" / "Never skip lint" (top open finding O1 from the
2026-06-11 review).

Scaffold docker_host as a proper placeholder via the prescribed mechanism
(make new-role): filled meta/main.yml + README, an honest no-task tasks/main.yml
documenting planned scope (Docker engine + Compose, daemon hardening, nftables.d
container rules per ADR-004/020), and the standard molecule scenario. This
preserves site.yml's full-standard-state intent rather than dropping the play.

Update STATUS.md (docker_host moves from "Not in git" to "scaffolded, no tasks")
and the role/playbook READMEs to match.

make lint: 0 failures, 0 warnings; check-tags OK.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:53:55 +02:00
1da117d65b docs(review): 2026-06-11 repo audit — fix build-wave doc drift
/review-repo run at 67f2aba. Auto-fixed 5 safe doc-drift items left by the
base(firewall)+dev_env build wave: README/playbook/role notes that still called
the roles "empty/not built", plus README tree gaps and the reciprocal ADR-021
cross-links in ADR-016/020.

18 open findings reported (not fixed). Headline: `make lint` is red on `main`
(site.yml imports the non-existent docker_host role) and an ADR-004 <-> ADR-022
backup-scope contradiction. Deferral checklist clean (0 stale-deferred); 7 of
12 prior findings confirmed resolved. See docs/reviews/2026-06-11-review.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:48:00 +02:00
67f2aba9d8 STATUS: record dev_env (built+applied) and working deploy path
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:21:36 +02:00
aea4f8c3d6 dev_env: install Node.js from pinned tarball, drop npm bloat
Debian's npm package pulls a ~400-package node-* tree (the first deploy
installed 527 packages). Replace apt nodejs+npm with a pinned upstream Node
tarball (v20.19.2) installed to /opt + symlinked, mirroring the nvim install
pattern (ADR-014 pinning). npm/npx come bundled. Molecule verifies node/npm
on PATH; lint + idempotent converge green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:21:33 +02:00
6203513220 inventory: manage ubongo (control node) as the operator account
group_vars/all assumes the ansible service user (created by bootstrap on
Terraform VMs). ubongo is the manually-provisioned control node (ADR-009/
ADR-015 exception) with no bootstrapped ansible user, so connect as sjat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:09:15 +02:00
607423d0e7 dev_env: install acl for become_user file copies
When the login user differs from the become_user (ubongo connects as sjat,
the role copies files as claude), Ansible needs ACLs on its temp files;
without the acl package it falls back to an unsupported chmod syntax and
fails. Molecule didn't catch it (root login can chown directly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:09:12 +02:00
a2bb99928c fix(deploy): make check/deploy actually run
Two latent bugs that blocked the documented deploy path (never exercised
end-to-end before applying dev_env to ubongo):
- Makefile: the PLAYBOOK variable was both the ansible-playbook BINARY path
  and the user-supplied playbook NAME, so `make check/deploy PLAYBOOK=<name>`
  overrode the binary. Renamed the binary var to PLAYBOOK_BIN.
- ansible.cfg: stdout_callback=yaml and callbacks_enabled=timer were
  community.general plugins (not installed; boma only ships ansible.posix).
  Use the built-in default callback with callback_result_format=yaml and
  ansible.posix.profile_tasks — same intent, no new heavy collection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 14:09:12 +02:00
f3f382ae69 Add dev_env role: zsh/tmux/nvim for workstation-class hosts
A new role (separate from base) that gives workstation-class hosts (ubongo
now, mamba later) a clean interactive environment: zsh + oh-my-zsh +
oh-my-posh, tmux + TPM plugins, and neovim. Dotfiles are real files deployed
via GNU stow (not templated); pinned nvim v0.12.2 + oh-my-posh 29.0.1.

Configs re-derived (ADR-013) from AnsibleBaobabV4 + the operator's fisi setup
on boma's terms: no Nerd Font (headless host), no system LSP suite (nvim uses
mason), versions pinned (V4 tracks latest). Applied via playbooks/workstation.yml
to the control group for users sjat + claude. Lint + Molecule (idempotent) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 13:50:11 +02:00
b9daf2a0ad plan: record ubongo build outcome (done/deferred/follow-ups)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:33:18 +02:00
349d10d65c docs: record ubongo physical build (2026-06-11)
Move ubongo to 'Built (partial)' in STATUS; fill real M70q hardware specs
(i3-10100T, 16 GB, 256 GB SanDisk X600 SATA, no disk encryption). Record in
ADR-015 the dedicated claude AI-worker identity, LAN-SSH-only operational
reality, and the no-encryption decision; close the rbw offline-cache
recovery-verification item (ADR-015 + rotate-secrets). Add accepted-risk R5
(control-node disk unencrypted at rest) with its compensating controls.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:32:26 +02:00
7b5fd17e55 inventory: add ubongo to control group; set ssh-from-control addr
Wire the now-built physical control node ubongo (10.20.10.151) into the
production control group (the documented manual exception), and activate the
dormant base__firewall_control_addr knob (ADR-021 ssh-from-control source).
Forward-wiring only: no host has the base role applied yet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:32:24 +02:00
7b190e4313 Add ubongo physical-build plan (2026-06-11 session)
Captures the interactive build decisions (no-encryption + accepted risk,
simple partition, dedicated claude identity, LAN-only access, pinned
versions) and the A-F + H task breakdown. Sequel to the 2026-06-05
docs-only ADR-015 plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 10:01:41 +02:00
7ebbc113ab Merge feat/adr-structure: ADR-023 structure & lifecycle + back-catalogue conformance
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 15:18:48 +02:00
fa3db421dc docs(kaizen): FRICTION signal — controller must diff-audit subagent restructures
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 15:01:21 +02:00
d0a3307822 docs(adr): fix 007/008 heading nesting; require date in Superseded status
Final-review polish: demote the sub-headings under the demoted 'IP addressing'
(007) and 'Three testing levels'/'What Molecule tests' (008) to #### so they
nest correctly instead of flattening to siblings. Tighten the adr-structure
Superseded pattern to require '(YYYY-MM-DD)' per ADR-023.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 15:00:58 +02:00
0df24909e3 docs(adr): restructure ADRs 016-018 to ADR-023 conformance
Make the existing Status sections parseable (Accepted (date) + the existing
designed-not-built note) and add Consequences sections assembled from each
ADR's already-stated residual risks, trade-offs and build status. No
decision substance changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:51:51 +02:00
40a428975a docs(adr): restructure ADR-003 to ADR-023 conformance
Add Status, a descriptive Context, a Decision umbrella over the existing
topical sections (demoted to ###), and a Consequences section assembled
from the ADR's already-stated rationale. No decision substance changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:50:03 +02:00
6d7d27b03b docs(adr): add Proposed lifecycle state; mark ADR-011 Proposed
Revisits the lifecycle decision on the evidence of ADR-011 (a real draft
with open questions). Adds a fourth state, Proposed (YYYY-MM-DD), to ADR-023,
the template, the adr-structure check (+test), spec and plan. Sets ADR-011's
Status to Proposed and removes its now-redundant inline 'Proposed' line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:48:55 +02:00
b3ca510380 docs(adr): restructure ADRs 010,011,013 to ADR-023 conformance
010/011: relabel Decisions->Decision + add Status/Consequences.
013: add Status + Decision umbrella (existing Consequences untouched).
No decision substance changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:43:41 +02:00
44dbd4628f docs(adr): restructure ADRs 006-009 to ADR-023 conformance
Add dated Status sections, a Decision umbrella over the existing topical
sections (demoted to ###), and Consequences assembled from each ADR's
already-stated implications. No decision substance changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:41:24 +02:00
188882449d docs(adr): restructure ADRs 001,002,004,005,012,014,015 to ADR-023 conformance
Add dated Status sections and (where missing) Consequences sections assembled
from each ADR's already-stated implications. No decision substance changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:39:00 +02:00
9b1502cf7d docs(adr): register ADR-023 and note adr-structure check
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:33:55 +02:00
a9aab9d040 docs(adr): ADR-023 — ADR structure & lifecycle
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:32:40 +02:00
3c920ae630 docs(adr): sync plan Task 2 with flat-comment template fix
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:31:23 +02:00
ab14d65aa1 docs(adr): add adr-template.md scaffold (ADR-023)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:30:52 +02:00
89179dd7c9 docs(adr): revise spec+plan — full retroactive restructure of 001-018
Replaces the Status-only backfill with a faithful presentational
restructure bringing the whole back-catalogue to 4-section conformance
(no grandfathering). Adds the faithfulness rule and per-file worklist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 14:28:20 +02:00
a3ea0f7d80 feat(review): add adr-structure check to repo-scan
Flags numbered ADRs missing a mandatory section (Status/Context/Decision/
Consequences) or with an unparseable Status line. Presence only, not order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 13:57:42 +02:00
ce3319cbed docs(adr): implementation plan + FRICTION signal for ADR structure
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 13:55:16 +02:00
dfbe37916f docs(adr): design spec for ADR structure & lifecycle (ADR-023)
Codifies the structure ADRs 019-022 converged on, pins an
Accepted/Superseded/Deprecated lifecycle with a no-silent-rewrite rule,
adds an adr-template.md scaffold, and plans a Status-header backfill of
ADRs 001-018. Basis for ADR-023.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 13:45:21 +02:00
4116286ed0 feat(hooks): Stop guard blocking the execution-mode menu
Mechanical fix for the 4×-recurring execution-mode menu ask (kaizen 2026-06-10).
A Stop hook reads the transcript and, if the final assistant message presents the
"subagent-driven vs inline — which approach?" menu, blocks the turn and tells the
model to proceed subagent-driven (boma's standing preference). Fails open,
respects stop_hook_active (no loop), tight match signature (no false positives on
meta-discussion). Pipe-tested across 5 scenarios. Activates next session
(settings watcher only tracks files present at session start).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 12:51:46 +02:00
91713127cb docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)
- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 12:51:39 +02:00
2dbcac11a0 chore(tooling): scope ansible-lint to ansible content; venv PATH in make test
Kaizen 2026-06-10 fixes:
- ansible-lint pre-commit hook now `always_run: false` + a files filter for
  roles/playbooks/inventories YAML, so docs-/config-only commits skip it and no
  longer need `rbw unlock` (root cause was ansible-lint auto-decrypting the
  group_vars vault, not the syntax-check).
- `make test`/`test-all` prepend $(CURDIR)/.venv/bin to PATH so non-activated
  agent runs find ansible-config/ansible-playbook.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 12:51:30 +02:00
9be4366ac3 feat(backup): backup strategy foundation layer (ADR-022)
Plan 1 of the backup & DR strategy: ADR-022, per-service backup__* contract +
BACKUP.md governance (template + checklist gate + new-role runbook step + dormant
/check-backup), and hardware/CAPABILITIES updates. Docs-only; the backup role and
live restore testing are Plans 2-3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:32:36 +02:00
ed6d5463aa docs(backup): final-review fixes — stateless BACKUP.md, dump-step wording, spec sync
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:32:06 +02:00
1e85c11ede docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 11:25:37 +02:00
5f946ac640 feat(backup): add dormant /check-backup verifier (ADR-022) 2026-06-10 11:22:57 +02:00
01e47d0890 docs(backup): add BACKUP.md step to new-role runbook (ADR-022)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:21:56 +02:00
81dac4f28b docs(backup): gate BACKUP.md in service checklist (ADR-022) 2026-06-10 11:20:55 +02:00
f3f80443d0 docs(backup): add BACKUP.md template + backup__* contract (ADR-022)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:20:01 +02:00
f5c97d1f36 docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-10 11:19:01 +02:00
da116e1d92 docs(friction): log execution-mode ask (4th occurrence)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:06:25 +02:00
2041bd3b70 docs(backup): add foundation-layer implementation plan (ADR-022)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:05:17 +02:00
eaffd8d900 docs(backup): add backup & DR strategy design (→ ADR-022)
Data-only restic backups, rebuild-from-code recovery (Model A); central
off-cluster pull node (fisi) with 8TB mirror; 3-2-1 via pCloud (rclone)
+ rotated USB air-gap. Per-service backup__* contract + BACKUP.md as a
hard convention. Two-tier restore testing (ubongo container restore-verify
+ semi-annual staging DR rehearsal). One restic password escrowed to
Vaultwarden + paper (restic + vault passwords) for a non-circular
break-glass. Dead-man's-switch alerting via Uptime Kuma.

Resolves TODO 3.8; grounds ADR-011's backup-first assumption.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 11:00:01 +02:00
032adf1525 docs(friction): log execution-mode recurrence; fix list de-indents
Complete the 2026-06-09 entry (third recurrence of presenting the
execution-mode menu despite the standing subagent-driven preference) and
restore two continuation-line indents a markdown formatter had stripped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-10 08:54:37 +02:00
f151e99d04 docs(access): correct ADR-021 governance (runbook+gate, not scaffold)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:52:24 +02:00
13f0d482bd docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:48:31 +02:00
649925b303 docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:46:51 +02:00
384b94e34b feat(access): add /check-access verifier command (ADR-021, dormant)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:45:24 +02:00
0c507bbace feat(base): add ssh-from-control management-plane source (ADR-021)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:43:55 +02:00
46d091e82e docs(access): add ACCESS.md service record template
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:36:28 +02:00
f8098c2e15 docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-09 17:34:57 +02:00
0fe9e45f57 docs(access): add ADR-021 operational-access doctrine 2026-06-09 17:33:46 +02:00
cdbd66410a docs(access): implementation plan for ADR-021 operational access
Splits the work into Tranche A (land now: ADR-021, ADR-016/020
reconciliation, ssh-from-control firewall source, ACCESS.md template,
/check-access command, governance + index wiring) and Tranche B
(build-pending on infra: per-service access__* + rendered ACCESS.md,
/check-access running).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:16:49 +02:00
fd4bbbc977 docs(access): design operational-access doctrine (ADR-021)
Brainstorming spec for ADR-021: operational access as a deployment
deliverable. Two layers (host baseline + per-service), a three-tier
access ladder (mesh SSH -> LAN SSH from ubongo -> console break-glass),
declarative access__* data rendering ACCESS.md and driving a
/check-access verifier. Resolves TODO 3.2 (API access) and 7.2 (host
access); amends ADR-016 (SSH also from ubongo) and ADR-020
(ssh-from-control source).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:10:54 +02:00
fcfb056591 docs(friction): record host-nftables build gotchas (iif/iifname, molecule ansible_host, venv PATH, apply-path coverage)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 19:16:21 +02:00
402913efb3 fix(base): make rollback snapshot restorable (flush-prefixed)
Bare 'nft list ruleset' has no leading flush, so the timer's 'nft -f rollback'
was a no-op on first apply (empty file) and errored ('table exists') on later
applies — the auto-rollback silently did nothing, defeating the askari lockout
safeguard. Prepend 'flush ruleset' so the revert is atomic + self-contained.
Verified the snapshot->lockout->revert round-trip in an isolated netns.
Also fix stale STATUS prose (base is partially built, not absent).
2026-06-06 19:15:38 +02:00
90683c7912 docs: record base firewall concern built (ADR-020 host layer)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 19:10:27 +02:00
6fb104e934 test(base): molecule verify asserts rendered firewall rules + nft -c
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 19:07:24 +02:00
b006196cc5 fix(base): confirm firewall apply over a FRESH connection
established/related keeps the in-flight session alive across the swap, so the
prior 'next task runs' confirm always passed even if new connections were
bricked — the rollback was theater. reset_connection + wait_for_connection now
force a fresh handshake through the new ruleset; failure aborts the play and the
armed timer reverts. (meta: reset_connection ignores 'when' — benign extra
reconnect on no-op runs; verified idempotent in molecule.)
2026-06-06 19:06:39 +02:00
026a29f609 feat(base): safe nftables apply with systemd-run auto-rollback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 19:03:58 +02:00
bca74458fb fix(base): iifname for load-time safety; zone-source molecule fixture
nft -c rejects iif "wt0" when the interface is absent (container, or any host
before NetBird); iifname matches by name and is robust to wt0 coming/going.
Drop the ansible_host fixture override (the docker connection uses it as the
container name) — molecule covers zone resolution, pytest covers service->IP.
2026-06-06 19:02:50 +02:00
eeab5ed8de feat(base): render nftables ruleset from catalog (+ molecule fixture)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 18:57:44 +02:00
7dae93e4e1 fix(base): firewall resolver fails fast on empty/malformed sources; cover hosts: + proto default
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 18:56:04 +02:00
4127f8bc6b feat(base): firewall catalog resolver filter plugin + tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 18:51:10 +02:00
390cd3b335 feat(base): shared firewall catalog/zones + firewall defaults
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 18:49:40 +02:00
2486e31f7d feat(base): scaffold role + meta/README (firewall concern incoming)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 18:48:35 +02:00
03329d7d25 docs(plan): host nftables firewall implementation plan
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 18:47:48 +02:00
d7fbaca554 docs(spec): host nftables firewall design (ADR-020 build #1)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 18:40:50 +02:00
2ad50e4d5b docs(capabilities): note two-layer firewall model (ADR-020) 2026-06-06 16:00:19 +02:00
a9287427e3 docs(todo): mark 3.5 firewall strategy decided (ADR-020) 2026-06-06 16:00:01 +02:00
e24aab28b2 docs: link ADR-020; harden firewall guardrail to the service catalog 2026-06-06 15:59:47 +02:00
d311f67098 docs(adr): ADR-020 firewall strategy (two-layer + shared catalog) 2026-06-06 15:59:30 +02:00
8d1d8a88ea docs(friction): escalate execution-mode prompt; no plan→impl approval gate
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 15:57:40 +02:00
f700f4a475 docs(plan): firewall strategy ADR-020 landing plan
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 15:42:17 +02:00
2a65391c0e docs(spec): firewall strategy design (TODO 3.5 → ADR-020)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 15:36:24 +02:00
86bb3559ad STATUS: record tag standard + enforcement (ADR-019)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 15:23:58 +02:00
fac438cc92 fix(tags): recognize name: role key; only check roles: in plays
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 15:20:09 +02:00
5aeeb094eb feat(tags): enforce role imports carry their role-name tag
Adds role_tag_problems() to check-tags.py: every role imported in a
play's roles: block must carry its own role name as a tag (extra tags
allowed; templated role names skipped). Wires the check into main() so
make lint catches violations. 6 new unit tests (29 total, all passing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 15:12:48 +02:00
2e5a1e1e23 fix(tags): exclude molecule scenarios from tag scan; clarify ADR enforcement
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:50:14 +02:00
24b5e9361e docs(tags): ADR-019 + CLAUDE.md/TODO/CAPABILITIES (tagging standard)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:42:22 +02:00
9584cc2c76 feat(tags): Proxmox VM metadata convention (managed-by=terraform)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:39:19 +02:00
0b59107b33 feat(tags): enforce tag vocabulary in make lint; fix docker_host tag
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:37:43 +02:00
a3ea2aceb2 feat(tags): scan roles/+playbooks/ and fail on unknown tags
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:33:12 +02:00
b45118dac3 feat(tags): checker helpers — tag collection & allowed-set
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 09:28:03 +02:00
24397fa280 feat(tags): add allowed-tag vocabulary (tests/tags.yml) 2026-06-06 09:26:20 +02:00
04bfc26422 docs(plan): tagging standard implementation plan (ADR-019)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 09:21:15 +02:00
4ed9e9a8bf docs(spec): tagging standard design (TODO 3.7/3.11 → ADR-019)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 09:15:44 +02:00
9bdb3017bb CLAUDE.md: link ADR-018 (logging) 2026-06-06 07:07:43 +02:00
12baeba750 TODO: mark log management decided (ADR-018); reconcile 3.6 2026-06-06 07:07:01 +02:00
1021c6d25d STATUS: record logging pipeline + security alerting (ADR-018) 2026-06-06 07:06:06 +02:00
c6aa45037d ADR-012: track log-storage allocation + SSD wearout (ADR-018) 2026-06-06 07:05:15 +02:00
687d623a52 CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018) 2026-06-06 07:04:26 +02:00
6f68f8b8c5 accepted-risks: add R4 (no cryptographic WORM for logs) 2026-06-06 07:03:27 +02:00
30c6a93c28 ADR-002: make central-logging + alerting controls concrete (ADR-018) 2026-06-06 07:02:32 +02:00
2894319f01 Add ADR-018 (logging and log integrity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-06 07:01:36 +02:00
96f8f20c05 Add implementation plan for logging + log integrity (ADR-018)
Task-by-task docs plan: author ADR-018 and reconcile ADR-002, accepted-risks
(R4), CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md. Roles/pipeline deferred
on the base + service-role machinery.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 06:59:58 +02:00
8eb5ccf97d Add design spec for logging + log integrity (ship all to Loki)
All logs -> on-cluster Loki for troubleshooting/trends; a security-relevant
subset also ships write-only off-site to askari (append-only, tamper-resistant
against full-cluster compromise); skip WORM (accepted-risk R4). Alloy agent in
base; loki/grafana service roles; disk-wear handled as a design parameter.
Basis for ADR-018.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 22:03:31 +02:00
568729e7bd repo-scan: cut broken-path-ref + marker false positives
- broken-path-ref: skip template/generated-report paths — a placeholder
  (<service>) immediately following the match, a YYYY-MM-DD date token, or a
  path under a generated-report reviews/ dir (14 -> 0 on the current tree).
- marker: skip numbered-backlog references (TODO 8.2, TODO-3.1, TODO (2.2,
  TODO item 16) which point at the backlog, not code markers (35 -> 2; the
  remaining two are literal "TODO:" strings in a plan doc). Real code markers
  (TODO:, FIXME, etc.) still caught — verified with a synthetic fixture.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 20:37:40 +02:00
db76be2a63 review-repo: clear O7-O12 clarity items
- ADR-011: ruled-out row was "digest-pinning stateful" (contradicted Decision 2);
  now "digest-only (no readable tag)" — tag@digest is adopted (O7)
- ADR-003/010: act_runner names ubongo as the runner host, runner VM as a future
  option (O8)
- ADR-008: WireGuard Molecule-exclusion row reframed to NetBird wt0 data plane (O9)
- ADR-011: scheduled_jobs xref points to TODO 8.3, not ADR-010 (O10)
- CAPABILITIES: add /verify-service Level 4 capability row (O11)
- TODO 3.10: rewrite the garbled base-container question (O12)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 19:28:07 +02:00
8e4bf3dd88 ADR-006/014: clear two stale labels
Review O5/O6: ADR-006 mislabeled backend.tf as "Forgejo state backend" (its own
State-backend section chooses local state — Forgejo's API is read-only); ADR-014
called plugin reproducibility open though TODO 10.7 is done.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:55:17 +02:00
d8afa94c4b Name and propagate the offsite_hosts inventory group (askari)
Review O4: ADR-016 said askari gets "its own inventory group" but never named it.
Settled as offsite_hosts (off-site, distinct from on-site-but-off-cluster ubongo).
Added to VALID_GROUPS (tf_to_inventory.py), ADR-009 valid groups, ADR-001/ADR-016
host-group enumerations, and CLAUDE.md. Generated hosts.yml picks up the section on
the next make tf-inventory (a manual-exception group like control).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:54:54 +02:00
f0d189ca09 Thread the VERIFY.md convention through ADR-004/new-role/README
Review O1-O3: ADR-017's per-service VERIFY.md requirement now appears in the
ADR-004 service-role file table, as a new-role runbook step, and the README
docs index/tree are refreshed (ADRs 010-017, security/testing/hardware dirs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:52:42 +02:00
3dd03d4198 review-repo: 2026-06-05 report (4 auto-fixed, 12 open)
Stale-deferred check exercised: 6 open-deferred-items all confirmed genuinely
open, 0 stale-deferred. Top open: thread ADR-017 VERIFY.md convention through
ADR-004/new-role/README; name the askari inventory group (ADR-016).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:24:39 +02:00
666ad42634 review-repo: fix DNS-write contradictions + stale control-node/template refs
Auto-fixes from /review-repo:
- ADR-005 + new-host.md: drop "Terraform writes the host's DNS A record"
  (contradicts ADR-009 — dns role owns the zone; recurs from the 2026-05-30 run)
- ADR-005: control node is physical ubongo, not cloned from the template (ADR-015)
- CLAUDE.md: add the VERIFY.md template to Further reading
- TODO.md: typo fixes (we we / seperate)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:23:16 +02:00
f566fd17eb review-repo: add stale-deferred check for ADR Deferred entries
repo-scan.py now enumerates open ADR "Deferred/Open" items and flags any that
another file describes as resolved but which isn't marked resolved in place
(the recurring miss in docs/FRICTION.md). review-repo.md's Phase 2 reviewer
confirms each open item against later ADRs/STATUS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:13:49 +02:00
66d11cc352 FRICTION: stale-deferred-item pattern recurred a 3rd time — build the check
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:06:26 +02:00
d5c62c99ad STATUS/ADR-015: mark the three deferred design threads resolved
ubongo, the NetBird mesh, and Level 4 verification are design-resolved
(ADR-015/016/017 + specs + plans); STATUS now says so while keeping build
status honest. Also resolves ADR-015 deferred #2 (browser harness), which
was left open when ADR-017 landed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 18:01:14 +02:00
91d851fe4d TODO: mark headless-browsing + test-user standard decided (ADR-017) 2026-06-05 13:20:40 +02:00
01e4f96983 STATUS: record Level 4 service-UI verification (ADR-017) 2026-06-05 13:19:53 +02:00
eb415db96e Git-ignore verify screenshots; add testing/reviews dir
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 13:19:04 +02:00
920e47b50d CLAUDE.md: VERIFY.md role convention; link ADR-017
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 13:18:07 +02:00
22c0747c0b service-checklist: add Level 4 UI verification to the gate 2026-06-05 13:17:16 +02:00
25f04002df Add /verify-service skill for Level 4 UI verification (ADR-017) 2026-06-05 13:16:25 +02:00
05abb3b6a5 Add VERIFY.md template for service-UI acceptance (ADR-017) 2026-06-05 13:15:13 +02:00
2df1f98153 ADR-008: expand Level 4 into the verify-service harness (ADR-017) 2026-06-05 13:14:12 +02:00
cc3337502f Add ADR-017 (service-UI acceptance verification, Level 4) 2026-06-05 13:13:09 +02:00
be6a064f44 Add implementation plan for service-UI verification (Level 4)
Task-by-task: author ADR-017, expand ADR-008 Level 4, create the VERIFY.md
template + /verify-service skill, and reconcile the checklist/CLAUDE.md/
gitignore/STATUS/TODO. Buildable-now artifacts; live run stays deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 13:11:43 +02:00
2bd11b5aa9 Add design spec for service-UI verification (ADR-008 Level 4)
Resolves ADR-015 deferred item #2 + TODO 2.2/2.3: a Claude-driven exploratory
browser harness (/verify-service) that exercises staging service UIs through
real SSO, backed by a per-service VERIFY.md, with test users in staging
Authentik and a manual-test handoff. Basis for ADR-017.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 13:05:11 +02:00
5322cce5c6 FRICTION: resolving a deferred decision needs a doc-wide grep sweep
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 12:20:20 +02:00
cd62c5e098 new-host runbook: mesh VPN resolved to NetBird (ADR-016) 2026-06-05 11:52:22 +02:00
ed9fdcc10a CLAUDE.md: link ADR-016 (mesh VPN) 2026-06-05 11:51:36 +02:00
787aa3b8e1 STATUS: record NetBird mesh (coordinator + base enrollment) 2026-06-05 11:50:53 +02:00
841f666de9 CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016) 2026-06-05 11:50:04 +02:00
08165ffb68 accepted-risks: R3 now the concrete NetBird coordinator risk 2026-06-05 11:48:58 +02:00
2ae5cf4535 ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016) 2026-06-05 11:48:04 +02:00
5a32dd46d3 ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016) 2026-06-05 11:47:03 +02:00
ff796c64ca Add ADR-016 (mesh VPN — NetBird self-hosted on askari) 2026-06-05 11:45:45 +02:00
4b85b14f1f Add implementation plan for NetBird mesh VPN
Task-by-task docs plan: author ADR-016 and reconcile ADR-007 (retire VLAN-99
WireGuard), ADR-015 (resolve deferred #1), accepted-risks R3, CAPABILITIES,
STATUS, CLAUDE.md. Documentation-only; role/deployment waits on the base role.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:44:05 +02:00
99ace3eb48 Add design spec for mesh VPN (NetBird self-hosted on askari)
Resolves ADR-015 deferred item #1: the mesh VPN is NetBird, self-hosted on
askari, replacing ADR-007's VLAN-99 OPNsense WireGuard. Agent-per-host
enrollment via base, embedded local-user IdP, coordinator off-site for
outage survival. Basis for ADR-016.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 10:58:35 +02:00
a53941dffe CLAUDE.md: fix capabilities doc link after rename to CAPABILITIES.md 2026-06-05 09:50:28 +02:00
7a48a60f14 CLAUDE.md: fix project summary — control node is physical ubongo 2026-06-05 09:49:23 +02:00
a30c1af3f0 CLAUDE.md: link ADR-015; note ubongo as physical control node 2026-06-05 09:48:09 +02:00
9653a34241 STATUS: record ubongo control host as designed, not built 2026-06-05 09:47:24 +02:00
55a3666d16 accepted-risks: reserve R3 mesh-VPN coordinator (pending choice) 2026-06-05 09:46:40 +02:00
a2db8058e7 rotate-secrets: document offline vault break-glass for ubongo 2026-06-05 09:45:27 +02:00
b89ca8835a new-host runbook: control node ubongo is bare-metal
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:44:31 +02:00
3fb780c286 ADR-012/hardware: add ubongo as physical control node
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:43:09 +02:00
66064be7b2 ADR-008: tests run on ubongo; stub Level 4 service-UI acceptance 2026-06-05 09:42:01 +02:00
07bc1c83f0 ADR-009: control-node exception is a physical box, not a VM 2026-06-05 09:41:03 +02:00
1064716d49 ADR-005: control node bootstrap is bare-metal Debian on ubongo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-05 09:40:15 +02:00
15779be086 ADR-001: control node is physical ubongo outside cluster 2026-06-05 09:39:18 +02:00
5aca796fa0 Add ADR-015 (control/AI-worker host ubongo) 2026-06-05 09:37:56 +02:00
288 changed files with 26579 additions and 379 deletions

View file

@ -6,6 +6,7 @@ exclude_paths:
- .venv/
- .collections/
- .scaffold/
- tests/integration/.run/ # transient harness run dir (gitignored, generated)
- "**/vault.yml" # ansible-vault encrypted — not lintable YAML
# Warn only (don't fail) on these rules during initial setup

View file

@ -0,0 +1,49 @@
Operational-access verification (ADR-021)
Probe every documented way in to a service or host from `ubongo` and report which paths
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
`ACCESS.md` can never disagree. Argument: a service/role name or a host
(e.g. `/check-access photoprism`, `/check-access docker01`).
## Prerequisites (this is forward-looking — ADR-021 dependencies)
This skill cannot run until these exist; if any is missing, say so and stop — do not
improvise around it:
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
- The target host/service is deployed (staging or production inventory).
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
## Process
### Phase 0 — resolve the target
Resolve the argument to a host or a service role + its host. Load the `access__*` data
(service) or the host-baseline + break-glass record (host). State what you will probe.
### Phase 1 — probe each declared path
| Path | Probe | Green = |
|---|---|---|
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
fallback exists, do not drive it.
### Phase 2 — report
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
## Notes
- Read-only and non-destructive — probes confirm reachability, they do not change state.
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
control node + hosts exist.

View file

@ -0,0 +1,29 @@
---
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
---
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
command reports `not-yet-available` rather than failing.
## Preconditions
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
`not-yet-available` and stop.
## Checks (when live)
Load the `backup__*` data for the resolved role, then:
| Check | How | Green when |
|---|---|---|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
| integrity | `restic check --read-data-subset` (sampled) | no errors |
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.

View file

@ -0,0 +1,63 @@
# Kaizen — curate the friction log into improvements
Consume the **Open signals** in `docs/FRICTION.md`: decide a verdict for each, migrate
durable knowledge into the right docs, and archive consumed signals into the decisions
ledger. **Curate-only** — do not hunt for new signals; capture stays manual. This is an
interactive, judgment-dense pass: propose, the operator decides, you apply on approval.
Design: `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`.
## Phase 0 — scan
Run `python3 scripts/friction-scan.py > /tmp/kaizen.json`. It returns each Open signal as
`{tag, first_seen, age_days, recurrence_count, referenced_paths, still_exists, text}`.
Treat `still_exists: false` as a hint the signal may already be resolved.
## Phase 1 — triage
Order signals by `recurrence_count` desc, then `age_days` desc, then tag. **Group signals
that share a root cause** and curate them together. Present the agenda before editing
anything: total open, how many recurring (≥3), how many look already-resolved.
## Phase 2 — per-signal curation (interactive)
For each signal/group, present: a one-line restatement, the evidence (age, recurrence,
still-real), and a proposed **verdict**. Verdicts:
- **SYSTEMATIZE** — migrate the durable lesson into its right home (a runbook, an ADR,
`CLAUDE.md`, a new `scripts/repo-scan.py` check, or a hook).
- **CHANGE** — adjust an existing tool/convention/config rather than document it.
- **PARK***out-of-phase but not obsolete*. Remove from the active tree, but write a
ledger row recording **where it now lives (git SHA/branch/doc) and a resurrection
trigger**. The default for "not touched lately but not wrong."
- **REMOVE***obsolete*: superseded, wrong, never worked, duplicated. Ledger row states
why.
- **ALREADY-BUILT** — the systematization already exists / the fix landed; archive.
- **ACCEPTED** — conscious no-op (revisit-if-recurs); archive.
- **KEEP-OPEN** — still accruing, not ripe; leave it in *Open signals* (no ledger row).
Rules:
- **Knowledge is never removed** — SYSTEMATIZE/migrate it; only *active surface* (scripts,
checks, conventions, plugins) is parked/removed.
- Every reductive verdict must classify *why unused*: **obsolete → REMOVE**,
**out-of-phase → PARK**.
- The operator approves / modifies / rejects each verdict. On approval: do the mechanical
edit (migrate text into the target doc; **move the signal from *Open signals* into the
ledger table**; delete the parked/removed file) and show the diff.
- PARK and REMOVE both delete from the active tree — the difference is the ledger row.
Git history + the ledger row are the park mechanism; never create a `parked/` directory.
## Phase 3 — close-out
- Add a new dated block under `## Kaizen reviews — decisions ledger` (newest first), same
shape as the existing block: a table with columns **Signal (first seen) | Verdict |
Resolution / where it lives now**.
- **Bias-to-remove discipline check:** if every verdict this pass was SYSTEMATIZE/CHANGE
(only accreting), say so explicitly.
- **Self-eval (light):** is `/kaizen` being run often enough (oldest consumed age)? Should
the nudge thresholds in `scripts/friction-scan.py` change? Note it.
- Run `make lint` if any code/docs changed; revert anything that breaks it.
- Commit per `CLAUDE.md` git conventions (one logical unit — straight to `main` if
small/safe, a branch if sweeping; show the diff first for a branch).
- Print a one-line summary: `consumed X · parked Y · removed Z · kept-open W · migrated → <docs>`.
## Headless / cron (future)
Deferred until the notify + cron stack exists (`docs/TODO.md` 11.3). When run
non-interactively, **report only**: print the proposed verdicts and the nudge, do not edit
or commit.

View file

@ -25,7 +25,17 @@ report the rest, and write a tracked report to `docs/reviews/`.
### Phase 0 — deterministic pre-scan
Run `python3 scripts/repo-scan.py > /tmp/repo-scan.json`. It returns the **inventory**
(roles, ADRs, runbooks, playbooks, scripts — your shard list) and **exact findings**
(markers, broken refs, unencrypted vaults). Fold these into the report verbatim.
(markers, broken refs, unencrypted vaults, ADR-structure violations). Fold these into
the report verbatim.
It also emits two deferral checks (see Phase 2): `open-deferred-item` (every still-open
ADR "Deferred/Open" entry — a checklist to confirm) and `stale-deferred` (an entry
another file describes as resolved but which isn't marked resolved in place —
high-confidence, usually auto-fixable by marking the source ADR's entry RESOLVED).
Also run `python3 scripts/friction-scan.py --nudge` and include its one-line output in the
report's summary — it flags when the kaizen loop (`/kaizen`) is overdue (recurring signals,
backlog size, or age). This is a reminder only; do not act on `FRICTION.md` from here.
### Phase 1 — fan-out judgement review
Scale to repo size:
@ -42,6 +52,13 @@ location (file:line), description, suggested_fix, auto_fixable (bool)}`.
- Merge and dedupe all findings (deterministic + reviewer).
- Run **one cross-cutting reviewer** over the full ADR set + `STATUS.md` + `CLAUDE.md`
to catch contradictions that span files (per-shard agents can't see these).
- **Resolve the deferral checklist.** For every `open-deferred-item` from Phase 0,
judge whether it is *genuinely* still open: search later ADRs / `STATUS.md` for a
decision on that subject (a deferred item often resolves silently when a later ADR
lands). If it has been decided, it is a stale-deferred finding — the fix is to mark
that entry RESOLVED in its **source ADR's** Deferred list (the spot the resolving
ADR's own change won't have touched). Treat every `stale-deferred` finding as
high-confidence. This is the recurring miss logged in `docs/FRICTION.md`.
- Diff against the previous run's `docs/reviews/<prev>-findings.json` and tag each
finding **new / recurring / resolved**.
- Prioritise by severity; split into auto-fixable vs report-only.

View file

@ -0,0 +1,65 @@
Exploratory service-UI verification (ADR-008 Level 4 / ADR-017)
Drive a browser against a **staging** deploy of a service, exercise its
`roles/<service>/VERIFY.md` acceptance journeys plus free exploration, and write a
tracked report. Argument: the service/role name (e.g. `/verify-service photoprism`).
## Prerequisites (this is forward-looking — ADR-017 dependencies)
This skill cannot run until all of these exist; if any is missing, say so and stop —
do not improvise around it:
- `ubongo` with the `playwright` Claude Code plugin (browser automation tools).
- A **staging** deploy of the target service (ADR-008 Level 2).
- Authentik (staging) for test-user provisioning + SSO.
- `roles/<name>/VERIFY.md` present.
## Process
### Phase 0 — safety gate (staging only)
Confirm the target resolves to the **staging** environment/inventory, never production.
If you cannot prove it is staging, **stop** — exploratory clicking is destructive
(ADR-002). State why you stopped.
### Phase 1 — read intent
Read `roles/<name>/VERIFY.md`: the Critical user journeys, What good looks like, Not
browser-verifiable, and Test data sections.
### Phase 2 — test user
Provision (reuse-or-create) a test user in the staging Authentik `test` group, with
ephemeral credentials held only for this run. Never use a real/production account.
### Phase 3 — drive the browser
Via the `playwright` plugin, on `ubongo`: open the service's staging URL (resolved via
boma DNS), authenticate through the real Traefik + Authentik SSO flow, then execute each
`VERIFY.md` journey — judging pass/fail and screenshotting key states — and free-explore
for anything obviously broken. Save screenshots to the git-ignored `.verify-runs/`
working dir; avoid capturing credential screens.
### Phase 4 — write the report
Save to `docs/testing/reviews/YYYY-MM-DD-<name>.md` and overwrite
`docs/testing/reviews/latest.md`. Structure:
- **One-line verdict** — e.g. "5/5 journeys passed; one manual check pending".
- **Run metadata** — date, service, staging env, test user, reviewed commit SHA.
- **Per-journey result** — pass/fail against `VERIFY.md`, with the evidence (linked
screenshot path) and any observation.
- **Free-exploration findings** — anything noticed beyond the listed journeys.
- **Manual-test checklist** — the "Not browser-verifiable" items plus anything Claude
couldn't do: numbered steps, expected result, and why it was handed off.
### Phase 5 — clean up + commit
Offer to clean up the `test`-group user (or note that the staging rebuild will).
Commit the report markdown per CLAUDE.md git conventions. **Do not** commit
`.verify-runs/` (git-ignored).
## Notes
- Reports (markdown) are committed; screenshots stay local on `ubongo` in `.verify-runs/`.
- Exploratory and interactive — this is not a deterministic CI gate.

View file

@ -0,0 +1,70 @@
#!/usr/bin/env bash
#
# Stop guard for two external-skill gates that conflict with boma conventions, where
# prose reminders repeatedly failed to hold (docs/FRICTION.md):
#
# 1. The execution-mode menu — writing-plans / subagent-driven-development script a
# "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution
# handoff. boma's standing preference is to NEVER present it and proceed
# subagent-driven. (Recorded by the 2026-06-10 kaizen review; the 2026-06-17 review
# widened the matcher to also catch free-form *prose* re-asks of the same choice —
# e.g. "which execution approach?" — which the literal-menu matcher missed. The
# sibling push-vs-not re-ask is deliberately NOT hooked: a genuine "should I push?"
# is sometimes legitimate, so it stays a soft default via the
# dont-reask-settled-defaults memory rather than a hard block.)
# 2. The brainstorming spec-review gate — the brainstorming skill scripts "Spec written
# and committed … please review it before … the implementation plan." The standing
# agreement is to move directly from the committed spec to writing-plans. (Recorded
# by the 2026-06-14 kaizen review; 06-10/06-14 recurrences.)
#
# Fails OPEN: any parse/read problem → allow the stop. Respects stop_hook_active so a
# block can never loop. Match signatures are deliberately tight so they fire on the
# actual gate text, not on meta-discussion of it.
#
set -uo pipefail
input=$(cat 2>/dev/null) || exit 0
# Loop guard: if we already blocked once for this stop, let it through.
active=$(printf '%s' "$input" | jq -r '.stop_hook_active // false' 2>/dev/null) || exit 0
[ "$active" = "true" ] && exit 0
transcript=$(printf '%s' "$input" | jq -r '.transcript_path // empty' 2>/dev/null) || exit 0
[ -z "$transcript" ] || [ ! -r "$transcript" ] && exit 0
# Last assistant message's text blocks, joined.
text=$(jq -rs '
([ .[] | select(.type=="assistant") ] | last) as $a
| ($a.message.content // [])
| if type=="array" then [ .[] | select(.type=="text") | .text ] | join("\n")
elif type=="string" then .
else "" end
' "$transcript" 2>/dev/null) || exit 0
low="${text,,}"
if [[ "$low" == *"inline execution"* \
&& ( "$low" == *"which approach"* || "$low" == *"two execution options"* ) ]] \
|| [[ "$low" == *"subagent-driven or inline"* || "$low" == *"inline or subagent"* ]] \
|| [[ "$low" == *"subagent-driven vs inline"* || "$low" == *"subagent vs inline"* \
|| "$low" == *"inline vs subagent"* ]] \
|| [[ "$low" == *"execution approach"* && "$low" == *"?"* ]]; then
cat <<'JSON'
{"decision":"block","reason":"Execution-mode menu detected in your final message. boma standing preference (docs/FRICTION.md + always-subagent-driven-execution memory): never present the subagent-driven-vs-inline menu. Drop the menu and proceed with subagent-driven execution directly (superpowers:subagent-driven-development)."}
JSON
exit 0
fi
# Brainstorming spec-review gate: asking the user to review the committed spec before
# the implementation plan. Tight signature: "implementation plan" present, plus either the
# skill's literal "spec written and committed" line, or the review+spec+before combination.
if [[ "$low" == *"implementation plan"* \
&& ( "$low" == *"spec written and committed"* \
|| ( "$low" == *"review"* && "$low" == *"the spec"* && "$low" == *"before"* ) ) ]]; then
cat <<'JSON'
{"decision":"block","reason":"Brainstorming spec-review gate detected in your final message. boma standing agreement (docs/FRICTION.md): once the spec is written and committed, move directly to the implementation plan (superpowers:writing-plans) — do not stop to ask the user to review the spec first. Drop the gate and proceed."}
JSON
exit 0
fi
exit 0

View file

@ -1,12 +1,16 @@
#!/usr/bin/env bash
#
# PreToolUse guard (Bash): block `git commit` when the rbw vault agent is locked.
# The pre-commit ansible-lint hook decrypts vault.yml via rbw, so a commit while
# locked fails deep with a confusing error. This catches it early with a clear fix.
# PreToolUse guard (Bash): block `git commit` ONLY when the rbw vault agent is locked
# AND the commit would actually need the vault. The pre-commit ansible-lint hook decrypts
# vault.yml via rbw — but it is scoped (`files: ^(roles|playbooks|inventories)/.*\.ya?ml$`,
# always_run:false), so a docs-/config-only commit never triggers it and needs no vault.
# (2026-06-17 kaizen, docs/FRICTION.md: the old guard blocked *every* locked commit, so a
# docs-only commit got snagged needing a vault password it never uses.)
#
# Fails OPEN: only blocks on a definitive "rbw present AND not unlocked" signal.
# If rbw is missing, the command isn't a plain `git commit`, or `--no-verify` is
# used, the action is allowed.
# Fails OPEN: blocks only on a definitive "Ansible content staged AND rbw locked" signal.
# rbw missing, not a plain `git commit`, `--no-verify`, or no Ansible content staged → allow.
# When unsure it errs toward blocking (asking for an unlock is cheap; a deep pre-commit
# failure is not).
#
set -uo pipefail
@ -22,14 +26,25 @@ case "$cmd" in
esac
command -v rbw >/dev/null 2>&1 || exit 0 # rbw not installed — allow
rbw unlocked >/dev/null 2>&1 && exit 0 # unlocked — allow
if rbw unlocked >/dev/null 2>&1; then
exit 0 # unlocked — allow
fi
# rbw is LOCKED. Only block if this commit would run the vault-decrypting ansible-lint
# hook — i.e. staged content matches its `files:` scope. Mirror that regex exactly.
ANSIBLE_RE='^(roles|playbooks|inventories)/.*\.ya?ml$'
# rbw present but not unlocked (locked or agent not running) — the commit would
# fail in the pre-commit hook, so block early with guidance.
cd "${CLAUDE_PROJECT_DIR:-.}" 2>/dev/null || exit 0
files=$(git diff --cached --name-only 2>/dev/null) || exit 0
# `git commit -a/--all` also sweeps in modified tracked files that aren't staged yet.
# (Substring match — errs toward including them, which only ever over-blocks. Safe.)
case " $cmd " in
*" -a"*|*"--all"*) files="$files"$'\n'"$(git diff --name-only 2>/dev/null)" ;;
esac
# No Ansible content in the fileset → ansible-lint hook won't run → no vault needed → allow.
printf '%s\n' "$files" | grep -Eq "$ANSIBLE_RE" || exit 0
# Ansible content staged AND rbw locked — the commit would fail deep in pre-commit. Block.
cat <<'JSON'
{"hookSpecificOutput":{"hookEventName":"PreToolUse","permissionDecision":"deny","permissionDecisionReason":"rbw is locked — the pre-commit ansible-lint hook needs the vault password to decrypt vault.yml. Run: rbw unlock"}}
{"hookSpecificOutput":{"hookEventName":"PreToolUse","permissionDecision":"deny","permissionDecisionReason":"rbw is locked and this commit stages Ansible content — the pre-commit ansible-lint hook needs the vault password to decrypt vault.yml. Run: rbw unlock (docs-/config-only commits are exempt and won't hit this guard.)"}}
JSON
exit 0

View file

@ -56,6 +56,23 @@
}
]
}
],
"Stop": [
{
"hooks": [
{
"type": "command",
"command": "bash \"${CLAUDE_PROJECT_DIR:-.}/.claude/hooks/guard-execution-mode-menu.sh\"",
"timeout": 10,
"statusMessage": "Checking for execution-mode menu"
}
]
}
]
},
"statusLine": {
"type": "command",
"command": "bash \"${CLAUDE_PROJECT_DIR:-.}/.claude/statusline.sh\"",
"padding": 0
}
}

63
.claude/statusline.sh Executable file
View file

@ -0,0 +1,63 @@
#!/usr/bin/env bash
#
# Claude Code statusLine — shows working dir, model, and context-window usage.
# Wired via .claude/settings.json (statusLine.command). Receives the statusLine
# JSON on stdin; first stdout line is rendered (ANSI colour supported).
#
# Context usage comes straight from the input JSON — no transcript parsing:
# .context_window.used_percentage pre-calculated % of the window in use (input side)
# .context_window.context_window_size window size in tokens (1000000 for the 1M models)
# verified: Claude Code statusLine schema · code.claude.com/docs/en/statusline · 2026-06-17
#
# Fails soft: any parse problem prints nothing and exits 0 (never breaks the prompt).
set -uo pipefail
input=$(cat 2>/dev/null) || exit 0
command -v jq >/dev/null 2>&1 || exit 0
# pct<TAB>window<TAB>dir-basename<TAB>model-name (used_percentage preferred,
# else derived from current_usage, else 0). @tsv keeps spaces in the dir safe.
parsed=$(printf '%s' "$input" | jq -r '
(.workspace.current_dir // .cwd // "" | sub(".*/"; "")) as $dir
| (.model.display_name // "?") as $model
| (.context_window.context_window_size // 200000) as $win
| (
if (.context_window.used_percentage // null) != null then
.context_window.used_percentage
elif (.context_window.current_usage // null) != null then
((.context_window.current_usage.input_tokens
+ (.context_window.current_usage.cache_creation_input_tokens // 0)
+ (.context_window.current_usage.cache_read_input_tokens // 0)) / $win * 100)
else 0 end | floor
) as $pct
| [$pct, $win, $dir, $model] | @tsv
' 2>/dev/null) || exit 0
[ -z "$parsed" ] && exit 0
IFS=$'\t' read -r pct win dir model <<<"$parsed"
# Human window label: 1000000 -> 1M, 200000 -> 200k, else Nk.
case "$win" in
1000000) wlabel="1M" ;;
*) wlabel="$((win / 1000))k" ;;
esac
# Colour the bar/percentage by pressure: green <70, yellow 7089, red >=90.
if [ "$pct" -ge 90 ]; then col=$'\033[31m' # red
elif [ "$pct" -ge 70 ]; then col=$'\033[33m' # yellow
else col=$'\033[32m' # green
fi
dim=$'\033[2m'; rst=$'\033[0m'
# 10-cell bar; clamp fill to [0,10] so an over-100 reading can't overflow.
filled=$((pct / 10)); [ "$filled" -gt 10 ] && filled=10; [ "$filled" -lt 0 ] && filled=0
bar=""
for ((i = 0; i < 10; i++)); do
if [ "$i" -lt "$filled" ]; then bar+="█"; else bar+="░"; fi
done
printf '%s%s%s · %s · %s%s %d%%%s %sctx/%s%s\n' \
"$dim" "$dir" "$rst" \
"$model" \
"$col" "$bar" "$pct" "$rst" \
"$dim" "$wlabel" "$rst"

View file

@ -0,0 +1,22 @@
# syntax=docker/dockerfile:1
# Custom Caddy image: vanilla Caddy + the Gandi DNS-01 plugin (ADR-024).
#
# WHY: mesh/LAN-only services have no public A-record, so they cannot satisfy ACME
# HTTP-01; they need DNS-01 against Gandi (the M1 *.<domain> wildcard strategy).
# Caddy's official image ships no third-party DNS plugins, so we compile one in.
#
# WHERE to build: on ubongo (the control node) — NOT on askari/Hetzner. Google's Go
# module proxy 403s Hetzner IP ranges, which broke the original on-host build (M4a).
# Build here, push the pinned tag/digest to the Forgejo registry, pull on askari.
#
# Versions pinned (ADR-011/ADR-014). caddy-dns/gandi v1.1.0 -> libdns/gandi v1.1.0,
# which authenticates with a Gandi Personal Access Token via "Authorization: Bearer"
# against https://api.gandi.net/v5/livedns (the legacy Apikey scheme is gone — using
# a PAT in the old Apikey slot 403s, which is what sank the M4a attempt).
# verified: caddy-dns/gandi v1.1.0 sends the PAT as Bearer · WebFetch libdns/gandi
# client.go @master (go.mod requires v1.1.0) · 2026-06-15
FROM caddy:2.11.4-builder AS build
RUN xcaddy build v2.11.4 --with github.com/caddy-dns/gandi@v1.1.0
FROM caddy:2.11.4
COPY --from=build /usr/bin/caddy /usr/bin/caddy

6
.gitignore vendored
View file

@ -31,3 +31,9 @@ terraform/**/*.tfstate
terraform/**/*.tfstate.backup
terraform/**/terraform.tfvars
# .terraform.lock.hcl is intentionally tracked (pins provider versions)
# Service-UI verification screenshots (kept locally on ubongo, not committed — ADR-017)
.verify-runs/
# Integration-test transient run dir (ADR-025); diagnostics live under ~/integration-runs
tests/integration/.run/

View file

@ -19,6 +19,15 @@ repos:
rev: v24.12.2 # keep in sync with requirements.txt
hooks:
- id: ansible-lint
# Only run on Ansible content. ansible-lint loads the play context, which
# auto-decrypts inventories/*/group_vars/all/vault.yml via the wired
# vault_password_file (→ rbw) — so it needs `rbw unlock`. The upstream hook is
# always_run+pass_filenames:false (lints the whole project, every commit); we
# override always_run:false and add a files filter so docs-/config-only commits
# skip it (no vault needed). pass_filenames stays false → still a project lint
# when any Ansible file is staged.
always_run: false
files: ^(roles|playbooks|inventories)/.*\.ya?ml$
additional_dependencies:
- ansible-core==2.17.* # pin (not >=) — keep in sync with requirements.txt

View file

@ -24,4 +24,5 @@ ignore: |
.venv/
.collections/
.scaffold/
tests/integration/.run/
**/vault.yml

View file

@ -14,7 +14,8 @@ Keep it dense and command-focused. Verbose detail lives in `docs/`.
Homelab infrastructure automation for a Proxmox cluster running 25 Debian 13 VMs.
All hosts share a hardened base configuration. Each host runs a defined set of Docker
services deployed via Compose files rendered from Ansible templates. Ansible runs from
a dedicated control VM. CI runs on Forgejo Actions (self-hosted).
a dedicated physical control node (`ubongo`) outside the cluster. CI runs on Forgejo
Actions (self-hosted).
Full design rationale: `docs/decisions/`
@ -32,6 +33,8 @@ Full design rationale: `docs/decisions/`
| Scaffold a new role | `make new-role NAME=<name>` |
| Review repo for drift/cruft | `/review-repo` (Claude command) |
| Review hardware capacity | `/capacity-review` (Claude command) |
| Edit the vault (nvim, auto re-encrypt) | `make edit-vault [VAULT=<path>]` |
| Validate vault structure | `make check-vault [VAULT=<path>]` |
| Encrypt a vault file | `make encrypt FILE=<path>` |
| Decrypt a vault file | `make decrypt FILE=<path>` |
| Install Python deps | `make setup` |
@ -40,6 +43,8 @@ Full design rationale: `docs/decisions/`
| Terraform plan | `make tf-plan [TF_ENV=staging]` |
| Terraform apply | `make tf-apply [TF_ENV=staging]` |
| Regenerate Ansible inventory | `make tf-inventory TF_ENV=<staging\|production>` |
| Integration-test a host on a local VM | `make test-integration HOST=<name> [CERTS=…]` |
| Clean up integration test VMs | `make test-integration-clean` |
**Always `tf-plan` before `tf-apply`. Always `check` before `deploy`. Never skip lint.**
@ -50,11 +55,18 @@ Full design rationale: `docs/decisions/`
## Ansible conventions
- **FQCN always**: `ansible.builtin.template`, never `template`
- **Tags**: every task must have at least one tag; playbooks support `--tags` filtering
- **Tags** (ADR-019): import each role with its role-name tag once at the play level
(Ansible inherits it to every task). Tag a task/block with a concern tag from the
approved list (`tests/tags.yml`) only where it genuinely belongs to that concern —
don't invent tags or tag for tagging's sake. Target one axis at a time (role/service
*or* concern; tags are union/OR, never intersected). `make lint` enforces the vocabulary and that each role import carries its role-name tag.
- **Handlers**: use `listen:` topic strings, not direct name references
- **Variables**: `rolename__varname` double-underscore namespace for role defaults
- **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only
- **Loops**: prefer `loop:` over `with_items:`
- **Loop var keys**: index with `item['key']`, never `item.key` — a key named
`values`/`keys`/`items`/`get`/… resolves to the dict *method* (silently corrupt +
non-idempotent), not the value
- **Conditionals**: prefer `true`/`false` over `yes`/`no`
---
@ -71,7 +83,21 @@ Full design rationale: `docs/decisions/`
git commit** — the pre-commit ansible-lint hook decrypts `vault.yml`), run `rbw
unlocked`; if it exits non-zero, ask the user to `rbw unlock` and wait rather than
starting and failing partway. The agent stays unlocked 5h.
- To edit a vault file: `make decrypt FILE=<path>`, edit, `make encrypt FILE=<path>`
- To edit the vault: `make edit-vault` — decrypts → opens nvim → re-encrypts on `:wq`
(abort with `:cq`), then `check-vault` validates structure. No plaintext lands in the
work tree. Override the file with `VAULT=<path>`. (The lower-level `make decrypt`/
`encrypt FILE=<path>` still exist for scripted/non-interactive edits.)
- `make check-vault` validates the vault decrypts, is valid YAML, keeps secrets under the
nested `vault:` map, and has no empty leaves — printing a structure view with values
masked. Needs `rbw` unlocked. It also **flags any leaf still set to `CHANGEME`** (see
next bullet).
- **Stubbing a secret the operator must supply** (don't ping-pong over chat): when a new
secret is needed, the agent itself adds the vault entry with the sentinel value
**`CHANGEME`** plus a comment stating *what it is and how to obtain it*, wires the code
to `{{ vault.<service>.<key> }}`, and commits that. Then prompt the operator to run
`make edit-vault`, replace the `CHANGEME`(s) with the real value(s) — which never touch
the conversation — and re-encrypt. `make check-vault` lists any outstanding `CHANGEME`
placeholders so nothing is forgotten. The agent never handles the real secret.
---
@ -81,6 +107,12 @@ Full design rationale: `docs/decisions/`
- Every role must have a populated `README.md`
- Every role must have `meta/main.yml` filled in
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
data. A stateless service records `backup__state: false` with a reason.
- One service = one self-contained role; no shared multi-service roles (ADR-004)
- Role names: `snake_case`, descriptive nouns (`base`, `docker_host`, `reverse_proxy`)
- Use `make new-role NAME=<name>` to scaffold — never create role structure by hand
@ -99,13 +131,17 @@ inventories/
vault.yml
docker_hosts/ # hosts running Docker services
proxmox_hosts/ # Proxmox nodes themselves
offsite_hosts/ # off-site hosts (askari) — NetBird coordinator + watchdog
host_vars/ # per-host overrides
staging/ # safe to run freely
```
Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`
Host groups: `all`, `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`
(`control` holds the one manually-provisioned control node — see ADR-009.)
(`control` holds `ubongo`, the one manually-provisioned **physical** control node
outside the cluster; `offsite_hosts` holds `askari`, the off-site Hetzner host that
runs the NetBird coordinator + watchdog — also added manually. See ADR-009, ADR-015,
ADR-016.)
---
@ -138,12 +174,18 @@ Single-contributor, trunk-based (no merge requests / approval gates):
## Terraform conventions
- Terraform owns VM existence only — nothing inside a VM, and no DNS records
- Every TF-managed VM carries three Proxmox tags — `<env>`, its inventory `group`, and
`managed-by=terraform` — as **metadata only** (ADR-019). They do not feed inventory
or run-targeting; `tf_to_inventory.py` still groups by the `group` output field.
- Internal DNS is entirely Ansible (the `dns` role renders the zone from inventory)
- OPNsense is entirely Ansible; do not reach for a Terraform OPNsense provider
- Environments are separate directories (`staging/`, `production/`), not workspaces
- Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files
- `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored
- `.terraform.lock.hcl` is tracked (pins provider versions)
- Every module declares its own `required_providers` (in `versions.tf`) for any
non-hashicorp provider — otherwise TF infers `hashicorp/<name>` and `init` fails
(caught only by a live `tf-init`, not by static review)
- Full rationale: `docs/decisions/006-terraform.md`
---
@ -156,7 +198,7 @@ Single-contributor, trunk-based (no merge requests / approval gates):
- Edit vault-encrypted files directly — decrypt first, re-encrypt after
- Force-push or rewrite already-pushed history on `main`
- Add a collection to `requirements.yml` without a specific module need in existing role tasks
- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
- Open a firewall port anywhere but the `group_vars` service catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020)
- Disable or weaken a baseline control from ADR-002 (SSH hardening, nftables default-deny, fail2ban, auditd)
- Expose a service to the LAN/WAN without it sitting behind the reverse proxy with authentication (ADR-002)
- Deploy a service that hasn't cleared `docs/security/service-checklist.md` (record any deviation in `docs/security/accepted-risks.md`)
@ -187,24 +229,39 @@ Single-contributor, trunk-based (no merge requests / approval gates):
| Topic | File |
|------------------------|---------------------------------------|
| Architecture overview | `docs/decisions/001-architecture.md` |
| Capabilities overview (what boma does) | `docs/capabilities.md` |
| Build order / roadmap | `docs/ROADMAP.md` |
| Capabilities overview (what boma does) | `docs/CAPABILITIES.md` |
| Security baseline & strategy | `docs/decisions/002-security.md` |
| Accepted security risks | `docs/security/accepted-risks.md` |
| Per-service security checklist | `docs/security/service-checklist.md` |
| Per-service security record (template) | `docs/security/service-security-template.md` |
| Per-service verification spec (template) | `docs/testing/service-verify-template.md` |
| Heritage / V4 policy | `docs/decisions/013-heritage-v4.md` |
| Sourcing tech knowledge | `docs/decisions/014-knowledge-sourcing.md` |
| Toolchain choices | `docs/decisions/003-toolchain.md` |
| Docker & Compose model | `docs/decisions/004-docker-model.md` |
| Bootstrapping hosts | `docs/decisions/005-bootstrapping.md` |
| Control / AI-worker host (`ubongo`) | `docs/decisions/015-control-host.md` |
| Terraform | `docs/decisions/006-terraform.md` |
| Network topology | `docs/decisions/007-network.md` |
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
| Testing methodology | `docs/decisions/008-testing.md` |
| Service-UI verification (Level 4) | `docs/decisions/017-service-ui-verification.md` |
| TF ↔ Ansible handoff | `docs/decisions/009-provisioning-handoff.md` |
| Forgejo & CI | `docs/decisions/010-forgejo-ci.md` |
| Update management | `docs/decisions/011-update-management.md` |
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
| Logging & log integrity | `docs/decisions/018-logging.md` |
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
| Firewall strategy | `docs/decisions/020-firewall.md` |
| Operational access | `docs/decisions/021-operational-access.md` |
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
| ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` |
| Reverse proxy (Caddy) | `docs/decisions/024-reverse-proxy.md` |
| Local VM integration testing (ADR-025) | `docs/decisions/025-local-vm-integration-testing.md` |
| Integration testing runbook | `docs/runbooks/integration-testing.md` |
| Adding a new role | `docs/runbooks/new-role.md` |
| Adding a new host | `docs/runbooks/new-host.md` |
| Enrolling a NetBird client (laptop/phone) | `docs/runbooks/netbird-client.md` |
| Rotating vault secrets | `docs/runbooks/rotate-secrets.md` |
| Claude Code setup (per machine) | `docs/runbooks/claude-code-setup.md` |

111
Makefile
View file

@ -5,24 +5,45 @@ VENV := .venv
PYTHON := $(VENV)/bin/python
PIP := $(VENV)/bin/pip
ANSIBLE := $(VENV)/bin/ansible
PLAYBOOK := $(VENV)/bin/ansible-playbook
PLAYBOOK_BIN := $(VENV)/bin/ansible-playbook
GALAXY := $(VENV)/bin/ansible-galaxy
LINT := $(VENV)/bin/ansible-lint
MOLECULE := $(VENV)/bin/molecule
# Vault password is resolved via ansible.cfg (vault_password_file); no flag needed.
VAULT_ARGS :=
INVENTORY := -i inventories/production/hosts.yml
# Default vault file for edit-vault / check-vault (override with VAULT=<path>).
VAULT ?= inventories/production/group_vars/all/vault.yml
INVENTORY := -i inventories/production/
TF := terraform
TF_ENV ?= staging
MOLECULE_IMAGE := forgejo.nyumbani.baobab.band/sjat/molecule-debian13:latest
MOLECULE_DOCKERFILE := .docker/molecule-debian13/Dockerfile
# Custom Caddy + Gandi DNS-01 plugin (ADR-024). Build on ubongo, NOT askari/Hetzner
# (the Go module proxy 403s Hetzner IPs); push the pinned tag to the Forgejo registry.
CADDY_IMAGE := forgejo.nyumbani.baobab.band/sjat/caddy-gandi:2.11.4
CADDY_DOCKERFILE := .docker/caddy-gandi/Dockerfile
# Forgejo container registry (same host/user as the image tags above). `make registry-login`
# logs the Docker daemon in using vault.forgejo.registry_token (2026-06-17 kaizen) so image
# pushes are agent-completable non-interactively.
REGISTRY_HOST := forgejo.nyumbani.baobab.band
REGISTRY_USER := sjat
# For TF_ENV=offsite, source the Hetzner token from the vault into the environment
# (rbw must be unlocked). Read in-memory; never written to a tfvars file (CLAUDE.md).
ifeq ($(TF_ENV),offsite)
TF_TOKEN_ENV := TF_VAR_hcloud_token="$$($(ANSIBLE)-vault view inventories/production/group_vars/all/vault.yml | $(PYTHON) -c 'import sys, yaml; print(yaml.safe_load(sys.stdin)["vault"]["hetzner"]["token"])')"
else
TF_TOKEN_ENV :=
endif
.DEFAULT_GOAL := help
.PHONY: help setup collections lint test test-all check deploy encrypt decrypt new-role \
tf-init tf-plan tf-apply tf-output tf-inventory \
molecule-image molecule-image-push
.PHONY: help setup collections lint test test-all test-integration test-integration-clean \
check deploy encrypt decrypt \
edit-vault check-vault new-role \
tf-init tf-plan tf-apply tf-output tf-inventory tf-inventory-offsite \
molecule-image molecule-image-push caddy-image caddy-image-push registry-login
help:
@echo ""
@ -33,8 +54,12 @@ help:
@echo " make lint Run yamllint + ansible-lint"
@echo " make test ROLE=<name> Run Molecule tests for a role"
@echo " make test-all Run Molecule tests for all roles"
@echo " make check PLAYBOOK=<name> Dry-run a playbook (check mode)"
@echo " make deploy PLAYBOOK=<name> Run a playbook against production"
@echo " make test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1] Run ADR-025 integration cycle against a VM"
@echo " make test-integration-clean Prune stale integration-test VM snapshots"
@echo " make check PLAYBOOK=<name> [LIMIT=<host>] [TAGS=<tags>] [EXTRA=<args>] Dry-run a playbook (check mode)"
@echo " make deploy PLAYBOOK=<name> [LIMIT=<host>] [TAGS=<tags>] [EXTRA=<args>] Run a playbook against production"
@echo " make edit-vault [VAULT=<path>] Edit the vault in nvim (auto re-encrypts + checks)"
@echo " make check-vault [VAULT=<path>] Validate vault structure (values masked)"
@echo " make encrypt FILE=<path> Encrypt a vault file"
@echo " make decrypt FILE=<path> Decrypt a vault file"
@echo " make new-role NAME=<name> Scaffold a new role"
@ -44,11 +69,15 @@ help:
@echo " make tf-apply [TF_ENV=staging] Apply Terraform changes"
@echo " make tf-output [TF_ENV=staging] Print Terraform outputs as JSON"
@echo " make tf-inventory [TF_ENV=staging] Regenerate Ansible inventory from Terraform outputs"
@echo " make tf-inventory-offsite Generate offsite_hosts inventory (askari) into inventories/production/"
@echo ""
@echo " TF_ENV defaults to 'staging'. Use TF_ENV=production for production."
@echo ""
@echo " make molecule-image Build the Molecule test image locally"
@echo " make molecule-image-push Push the test image to the Forgejo registry"
@echo " make caddy-image Build the custom Caddy + Gandi DNS-01 image (run on ubongo)"
@echo " make caddy-image-push Push the Caddy image to the Forgejo registry"
@echo " make registry-login Log Docker into the Forgejo registry (vaulted token)"
@echo ""
# ── Environment setup ─────────────────────────────────────────────────────────
@ -67,6 +96,7 @@ collections:
lint:
$(VENV)/bin/yamllint .
$(LINT)
$(PYTHON) scripts/check-tags.py
# ── Testing ───────────────────────────────────────────────────────────────────
@ -74,30 +104,50 @@ test:
ifndef ROLE
$(error ROLE is required: make test ROLE=<rolename>)
endif
cd roles/$(ROLE) && ../../$(MOLECULE) test
cd roles/$(ROLE) && PATH="$(CURDIR)/$(VENV)/bin:$$PATH" molecule test
test-all:
@for role in roles/*/; do \
echo "── Testing $$role ──"; \
cd $$role && ../../$(MOLECULE) test; cd ../..; \
cd $$role && PATH="$(CURDIR)/$(VENV)/bin:$$PATH" molecule test; cd ../..; \
done
test-integration:
ifndef HOST
$(error HOST is required: make test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1])
endif
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py cycle \
--host $(HOST) $(if $(CERTS),--certs $(CERTS)) $(if $(KEEP),--keep)
test-integration-clean:
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py prune
# ── Playbook execution ────────────────────────────────────────────────────────
check:
ifndef PLAYBOOK
$(error PLAYBOOK is required: make check PLAYBOOK=<name>)
endif
$(PLAYBOOK) $(INVENTORY) $(VAULT_ARGS) --check --diff playbooks/$(PLAYBOOK).yml
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) $(EXTRA) --check --diff playbooks/$(PLAYBOOK).yml
deploy:
ifndef PLAYBOOK
$(error PLAYBOOK is required: make deploy PLAYBOOK=<name>)
endif
$(PLAYBOOK) $(INVENTORY) $(VAULT_ARGS) playbooks/$(PLAYBOOK).yml
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) $(EXTRA) playbooks/$(PLAYBOOK).yml
# ── Vault ─────────────────────────────────────────────────────────────────────
# Streamlined edit: ansible-vault edit decrypts to a temp file, opens nvim, and
# re-encrypts on :wq (abort with :cq) — no plaintext ever lands in the work tree.
# Then validate structure. Override the file with VAULT=<path>.
edit-vault:
EDITOR=nvim $(ANSIBLE)-vault edit $(VAULT)
@$(PYTHON) scripts/check-vault.py $(VAULT)
check-vault:
@$(PYTHON) scripts/check-vault.py $(VAULT)
encrypt:
ifndef FILE
$(error FILE is required: make encrypt FILE=<path>)
@ -118,19 +168,36 @@ molecule-image:
molecule-image-push: molecule-image
docker push $(MOLECULE_IMAGE)
# ── Custom Caddy image (Gandi DNS-01 plugin, ADR-024) ─────────────────────────
# DNS-01 (wildcard / mesh-LAN-only certs) needs the caddy-dns/gandi plugin compiled
# in via xcaddy. Build on ubongo — Google's Go module proxy 403s Hetzner IPs.
caddy-image:
docker build -t $(CADDY_IMAGE) -f $(CADDY_DOCKERFILE) .docker/caddy-gandi
caddy-image-push: caddy-image
docker push $(CADDY_IMAGE)
# Log the local Docker daemon into the Forgejo registry using the vaulted token, so the
# *-image-push targets above are agent-completable non-interactively (rbw must be unlocked).
registry-login:
@ANSIBLE_VAULT="$(ANSIBLE)-vault" PYTHON="$(PYTHON)" VAULT="$(VAULT)" \
REGISTRY_HOST="$(REGISTRY_HOST)" REGISTRY_USER="$(REGISTRY_USER)" \
bash scripts/registry-login.sh
# ── Terraform ─────────────────────────────────────────────────────────────────
tf-init:
$(TF) -chdir=terraform/environments/$(TF_ENV) init
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) init
tf-plan:
$(TF) -chdir=terraform/environments/$(TF_ENV) plan
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) plan
tf-apply:
$(TF) -chdir=terraform/environments/$(TF_ENV) apply
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) apply
tf-output:
$(TF) -chdir=terraform/environments/$(TF_ENV) output -json
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) output -json
tf-inventory:
ifndef TF_ENV
@ -140,6 +207,11 @@ endif
| $(PYTHON) scripts/tf_to_inventory.py > inventories/$(TF_ENV)/hosts.yml
@echo "Inventory written to inventories/$(TF_ENV)/hosts.yml"
tf-inventory-offsite:
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/offsite output -json \
| $(PYTHON) scripts/tf_to_inventory.py > inventories/production/offsite.yml
@echo "Offsite inventory written to inventories/production/offsite.yml"
# ── Role scaffolding ──────────────────────────────────────────────────────────
new-role:
@ -151,7 +223,14 @@ endif
roles/$(NAME)/molecule/default
echo "---" > roles/$(NAME)/tasks/main.yml
echo "---" > roles/$(NAME)/handlers/main.yml
echo "---" > roles/$(NAME)/defaults/main.yml
printf '%s\n' '---' \
'# Role defaults use the <rolename>__var double-underscore namespace.' \
'#' \
'# Service roles (ADR-004) also declare access__*/backup__* data here. Those are' \
'# cross-role conventions (not rolename-prefixed), so EACH such line needs a trailing' \
'# noqa: var-naming[no-role-prefix] (ansible-lint 24.x has no per-prefix allowlist).' \
'# Reference: roles/reverse_proxy/defaults/main.yml' \
> roles/$(NAME)/defaults/main.yml
echo "---" > roles/$(NAME)/meta/main.yml
printf '# %s\n\nRole description here.\n' "$(NAME)" > roles/$(NAME)/README.md
cp .scaffold/molecule.yml roles/$(NAME)/molecule/default/molecule.yml

View file

@ -57,7 +57,13 @@ See `Makefile` for the full list of targets.
├── docs/
│ ├── decisions/ # Architecture decision records (ADRs)
│ └── runbooks/ # Step-by-step operational procedures
│ ├── runbooks/ # Step-by-step operational procedures
│ ├── security/ # Per-service security checklist + templates + accepted risks
│ ├── testing/ # VERIFY.md template + service-UI verification reports
│ ├── access/ # ACCESS.md template (ADR-021)
│ ├── backup/ # BACKUP.md template (ADR-022)
│ ├── hardware/ # Physical capacity reference + reviews
│ └── reviews/ # /review-repo reports
├── inventories/
│ ├── production/ # Live hosts — edit carefully
@ -65,10 +71,12 @@ See `Makefile` for the full list of targets.
├── playbooks/ # Orchestration playbooks
│ ├── site.yml # Full standard state
│ ├── workstation.yml # Developer environment (control group)
│ └── bootstrap.yml # First-run new host setup
├── roles/ # Ansible roles
│ ├── base/ # OS baseline applied to all hosts
│ ├── dev_env/ # Interactive developer environment
│ └── docker_host/ # Docker runtime setup
├── terraform/ # VM provisioning only — no DNS (see ADR-006/009)
@ -92,6 +100,24 @@ See `Makefile` for the full list of targets.
- Network topology: `docs/decisions/007-network.md`
- Testing methodology: `docs/decisions/008-testing.md`
- Terraform ↔ Ansible handoff: `docs/decisions/009-provisioning-handoff.md`
- Forgejo & CI: `docs/decisions/010-forgejo-ci.md`
- Update management: `docs/decisions/011-update-management.md`
- Hardware & capacity: `docs/decisions/012-hardware-capacity.md`
- Heritage / V4 policy: `docs/decisions/013-heritage-v4.md`
- Sourcing technical knowledge: `docs/decisions/014-knowledge-sourcing.md`
- Control / AI-worker host (`ubongo`): `docs/decisions/015-control-host.md`
- Mesh VPN (NetBird): `docs/decisions/016-mesh-vpn.md`
- Service-UI verification (Level 4): `docs/decisions/017-service-ui-verification.md`
- Logging & log integrity: `docs/decisions/018-logging.md`
- Tagging & run-targeting: `docs/decisions/019-tagging.md`
- Firewall strategy: `docs/decisions/020-firewall.md`
- Operational access: `docs/decisions/021-operational-access.md`
- Backup & disaster recovery: `docs/decisions/022-backup.md`
- ADR structure & lifecycle: `docs/decisions/023-adr-structure.md`
- Reverse proxy (Caddy): `docs/decisions/024-reverse-proxy.md`
(CLAUDE.md carries the full cross-referenced table, including the runbooks and
security/testing docs.)
## Contributing

View file

@ -5,7 +5,7 @@ This repo is partly aspirational: the ADRs in `docs/decisions/` describe the
truth. **Before relying on a role, provider, or pipeline existing, check here.**
If something is listed as "designed, not built", do not assume it works.
_Last reviewed: 2026-05-30._
_Last reviewed: 2026-06-19._
## Real and working today
@ -20,30 +20,47 @@ _Last reviewed: 2026-05-30._
| Pre-commit hooks | Configured: lint, gitleaks, vault-encryption guard. Activate with `pre-commit install` after `make setup`. |
| Vault password client | `scripts/vault-pass-client.sh` fetches the master password from Vaultwarden via `rbw` (wired as `vault_password_file`). Requires `rbw` installed + `rbw unlock`. |
| `/review-repo` | Repo audit: `scripts/repo-scan.py` (Phase 0) + `.claude/commands/review-repo.md`, reports to `docs/reviews/`. On-demand only; cron + email deferred (`docs/TODO.md`). |
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below |
| `/kaizen` | Curate `docs/FRICTION.md` Open signals → decisions ledger (`scripts/friction-scan.py` Phase 0, unit-tested, + `.claude/commands/kaizen.md`). Interactive, on-demand; `--nudge` (recurrence/age/backlog) surfaces in `/review-repo`. Headless/cron deferred (TODO 11.3). |
| Terraform HCL (`terraform/`) | Written (proxmox VM module + envs) — but never run; see below. Offsite env also written — see "Designed but not built". |
| `docs/hardware/reference.md` + `scripts/capacity-scan.py` | Present — reference doc (skeleton until real hardware) + stdlib scan; emits capacity JSON |
| `/capacity-review` | Works — on-demand capacity evaluation → `docs/hardware/reviews/`. Intent-based (no live usage yet) |
| ADR-002 security strategy + `docs/security/{accepted-risks,service-checklist}.md` | Present — threat model, principles, governance frame; checklist + risk register are docs, enforced manually in review |
| Service-role standard + per-service `SECURITY.md` convention | Defined (ADR-004 + `docs/security/service-security-template.md`); not yet applied — no service roles exist |
| Tag standard + enforcement (ADR-019) | Works — `tests/tags.yml` (closed vocabulary) + `scripts/check-tags.py` (run by `make lint`, unit-tested): enforces the tag vocabulary and that each role import in a play's `roles:` block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (`<env>`, group, `managed-by=terraform`) is in the Terraform HCL but unprovisioned. |
| `roles/dev_env/` — interactive developer environment | **Built + applied.** zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. **Applied to `ubongo`** for users `sjat` + `claude` (verified: zsh login shells, stow-symlinked `.zshrc`/`.tmux.conf` + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via `playbooks/workstation.yml` against the `control` group (no dedicated `workstations` group yet). |
| `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). |
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker + libvirt groups, **`NOPASSWD:ALL` sudo** — ADR-015 amended 2026-06-18; operator `sjat` uses password-required sudo via `sudo` group; the former `sjat-ansible` NOPASSWD drop-in removed 2026-06-18). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **NetBird mesh-enrolled (M5, 2026-06-17):** `wt0` up at `100.99.146.14` via the `base` `mesh` concern. **`base` firewall applied (mesh-hardening 2/3, 2026-06-19):** INPUT-only default-deny — input locked to `wt0` + ssh-from-control (`10.20.10.151`) + workstations (`10.20.10.50` mamba, `10.20.10.17`); forward `accept` (Docker/libvirt-NAT safe). Live-verified (SSH self-path + Docker egress, after a post-apply `restart docker` — base's flush wipes Docker nat, FRICTION); **real-host reboot-validated (2026-06-19):** after an operator reboot, the `policy drop` input chain + full allow-list re-applied on boot and the `wt0` mesh + SSH self-path came back clean. `claude` now self-SSHes (ad-hoc `authorized_keys` grant so the agent can run SSH-based deploys with the auto-rollback safety; fold into the control-node bootstrap). **Pending:** full `base` hardening (auditd/CIS); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservations (10.20.10.151 MAC `88:a4:c2:e0:ee:da` + the `.50`/`.17` workstation leases); Terraform state backup (now relevant — the offsite tfstate exists). |
| `askari` — off-site Hetzner VPS (ADR-007/016, M2) | **Built + applied.** Provisioned by Terraform (`environments/offsite`, `hetznercloud/hcloud`) as **cx23 / hel1 / Debian 13.5** (CAX11/ARM was out of stock EU-wide on 2026-06-14 → cx23 is same-spec x86, cheaper). cloud-init created the `ansible` user + passwordless sudo; a TF-managed Hetzner Cloud Firewall allows SSH only from ubongo's WAN (`91.226.145.80`). Reachable from ubongo (`ansible offsite_hosts -m ping` ✓), in the `offsite_hosts` inventory (generated `offsite.yml`), published at `askari.wingu.me``77.42.120.136`. **SSH-hardened + fail2ban (M3).** **Docker + Caddy reverse proxy (M4a):** `docker_host` + `reverse_proxy` (vanilla Caddy, HTTP-01) applied; `https://test.askari.wingu.me` serves a valid Let's Encrypt cert ✓ (firewall opens 80/443/3478). **NetBird coordinator (M4b):** `netbird_coordinator` deployed — dashboard live at `https://netbird.askari.wingu.me` (valid LE cert), management API behind embedded Dex (401 unauth), STUN on 3478/udp. **NetBird peer (M5, 2026-06-17):** also enrolled as a mesh agent (`base` `mesh` concern) — `wt0` at `100.99.226.39`, Management+Signal Connected; the agent coexists with the coordinator. **Mesh-hardening redesign applied + live reboot-validated (2026-06-20):** `base` INPUT-only nftables default-deny (`inet filter` input `policy drop`; forward `accept`, Docker-safe via a post-apply `restart docker`), SSH `wt0`-primary + a permanent WAN break-glass (ubongo's WAN `91.226.145.80`; the Hetzner console is the OOB ultimate fallback), managed over `wt0`; `netbird_coordinator` geolocation disabled (`NB_DISABLE_GEOLOCATION`) so a no-egress boot can't FATAL it. A real reboot recovered **unattended** — firewall persisted, Docker forwarding + public services (Caddy 80/443, STUN 3478) up, coordinator geo-disabled (no FATAL), `wt0`/mesh (Management+Signal Connected) + both SSH paths back. **Pending:** offsite tfstate backup (ADR-022); relay-SPOF reduction (next mesh-hardening sub-project — `ubongo→askari` is currently `Relayed` through askari's own relay). |
| `roles/docker_host/` (Docker engine) + `roles/reverse_proxy/` (Caddy, ADR-024) | **Built + applied** (askari, M4a). `docker_host` installs Docker CE + compose; `reverse_proxy` is boma's standard Caddy proxy (HTTP-01 for public hosts; routes from `reverse_proxy__routes`). **DNS-01 for mesh/LAN-only services is now built + proven (2026-06-15):** custom `caddy-gandi` image (`.docker/caddy-gandi/`, `make caddy-image`, pinned caddy-dns/gandi v1.1.0 → Bearer PAT), enabled per-instance via `reverse_proxy__acme_dns_provider: gandi` + `reverse_proxy__image`. Verified end-to-end — a real wildcard cert issued via LE **staging** + Gandi DNS-01 with `vault.gandi.pat`. M4a's deferral (version skew + Hetzner-IP build) is closed; image **pending registry push** (`make caddy-image-push` needs `docker login`). The `reverse_proxy` Caddyfile is bind-mounted as a **directory** (`./caddy``/etc/caddy`) so atomic re-renders are visible in-container and `caddy reload` actually applies new routes (a single-file mount pinned the stale inode). |
| `roles/netbird_coordinator/` — NetBird control plane (ADR-016, M4b) | **Built + applied (askari, 2026-06-16). boma's FIRST real service role.** Self-hosted NetBird **v0.72.4**: a single combined `netbird-server` container (management + signal + relay + STUN + **embedded Dex IdP** at `/oauth2`) + `dashboard:v2.39.0`, on the shared `boma` network behind the M4a Caddy via gRPC-h2c + WebSocket + path routing (`reverse_proxy__routes` gained a raw-`caddy` route type). Secrets `vault.netbird.{auth_secret,datastore_key}` (self-generated). Carries the full service-role file set (SECURITY/VERIFY/ACCESS/BACKUP) — **first stateful role** (`backup__state: true`; encrypted SQLite at `/var/lib/netbird`, off-site backup pending `fisi`/ADR-022). **Verified live:** dashboard 200 + valid LE cert, `/api` 401 (auth-gated, routes OK), STUN up. **Not yet configured:** first-boot `/setup` admin + peer enrolment = M5. |
## Scaffolded but empty — NOT implemented
| Thing | State |
|---|---|
| `roles/base/` | Not in git — only an empty dir on disk (untracked). `site.yml` references it, so a clean clone errors on `make deploy PLAYBOOK=site` until it is built. |
| `roles/docker_host/` | Not in git. Same. |
| `roles/base/` | **Partially built.** Concerns built: `firewall` (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) and **`hardening`** (M3: sshd drop-in key-only + `PermitRootLogin no`, fail2ban sshd jail 5/1h; ADR-002) — both pytest/Molecule-tested. The **`hardening`** concern is **applied to askari** (`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`). The `firewall` concern is **applied to ubongo** (mesh-hardening 2/3, 2026-06-19) **and askari** (mesh-hardening redesign, 2026-06-20) — both INPUT-only default-deny via the `base__firewall_input_only` knob (input default-deny + `wt0`/ssh-from-control/`base__firewall_admin_addrs` allow-list; forward left `accept` so Docker/libvirt-NAT survive), both **live reboot-validated**. On a Docker host (askari) base's `flush ruleset` wipes Docker's nat, so the cutover follows the firewall apply with a `restart docker` to rebuild it (FRICTION). Not built: auditd, packages, users (Phase 2 / TODO 15). The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`) so a local-DNS hiccup can't strand the mesh — **applied + live on ubongo (2026-06-20)**: `getent hosts netbird.askari.wingu.me``77.42.120.136`, mesh unaffected. The single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment). |
| `inventories/*/hosts.yml` | Structured stubs with empty host maps (`hosts: {}`); regenerated by `make tf-inventory` once Terraform has hosts |
| `inventories/production/group_vars/{docker_hosts,proxmox_hosts}/` | Empty dirs |
So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `base` and
`docker_host` roles it calls do not exist yet.
(`roles/docker_host/` is no longer scaffold-only — it installs the Docker engine + Compose
and is built + applied to askari; see "Real and working today". Its deferred scope —
daemon hardening + `nftables.d` container rules, ADR-004/ADR-020 — is still pending.)
A `make deploy PLAYBOOK=site` run now applies real content — `base` (its `firewall` +
`hardening` concerns) plus a functional `docker_host` (Docker engine) on docker hosts —
but in practice it is still limited: the production cluster has no docker hosts yet, and
`base`'s `firewall` concern is now applied to `ubongo` (control) but not yet to cluster docker hosts (none exist), so a full cluster `site` run does not
yet exist. (The `make check`/`deploy` machinery itself works — first proven by applying
`dev_env` via `playbooks/workstation.yml`, then `base`/`docker_host`/`reverse_proxy` on
askari.)
## Designed but not built
| Thing | Designed in | Notes |
|---|---|---|
| `dns` role (renders the internal zone) | ADR-007 / ADR-009 | Does not exist. Internal DNS ownership is assigned to it by design. |
| Terraform actually provisioning | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
| Terraform actually provisioning (Proxmox) | ADR-006 / ADR-009 | Never `terraform init`ed: no `.terraform.lock.hcl`, no state, no real `local.vms` entries |
| CI (Forgejo Actions) | ADR-003 / ADR-008 | Pipeline described; not implemented |
| Level 2 / 3 testing (staging, `askari` smoke) | ADR-008 | Depends on real VMs / `askari`, which don't exist yet |
| Per-service roles | ADR-004 | Model defined; no service roles built |
@ -52,6 +69,29 @@ So `make deploy PLAYBOOK=site` currently **fails** on a clean clone — the `bas
| `/security-review` skill | ADR-002 / TODO 8.5 | Periodic posture re-check + accepted-risk re-challenge; planned, not built |
| CIS hardening (Debian L1+L2 + Docker) | ADR-002 / TODO 15 | Implemented by the (unbuilt) `base`/`docker_host` roles; brings AppArmor + AIDE as baseline. L2 partitions affect VM provisioning (ADR-006) |
| Network IDS + security alerting | ADR-002 / TODO 15 | Suricata on OPNsense + AIDE/`auditd`/`fail2ban` alerting into the monitoring stack; not built |
| NetBird mesh — coordinator on `askari` | ADR-016 | **BUILT + applied (M4b, 2026-06-16)** — moved up to "Real and working today" (`roles/netbird_coordinator/`). Self-hosted control plane on askari; replaces ADR-007 WireGuard. Mesh **peer enrolment = M5** (next row). |
| NetBird agent enrollment in `base` | ADR-016 | **BUILT + applied (M5, 2026-06-17).** The `base` `mesh` concern (opt-in `base__mesh_enabled`) installs the pinned NetBird agent + runs `netbird up` with the reusable scoped key from `vault.netbird.setup_key`. Applied to **askari (`100.99.226.39`) + ubongo (`100.99.146.14`)** — both Management+Signal Connected; ubongo↔askari mesh ping verified. Enrollment is **additive** — the "SSH only on `wt0`" firewall lockdown is the deferred mesh-hardening follow-on, NOT applied. **Road-warrior clients (`mamba` + work laptop) enrolled (2026-06-17) → `ubongo` reachable from anywhere: the mobile-access goal is met and Phase 1 (remote access) is COMPLETE.** Client enrollment runbook: `docs/runbooks/netbird-client.md`. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
## Integration test harness (ADR-025)
| Thing | State |
|---|---|
| `roles/integration_test/` | **Built** — installs/enables libvirt+QEMU+virtinst on `control` group hosts; adds `sjat`/`claude` to `libvirt` group; creates image-cache dir. Lint clean; applied live to ubongo (substrate installed); molecule scenario present, not run in the build env. |
| `scripts/integration-vm.py` | **Built** — stdlib-only lifecycle driver over `virsh`/`virt-install`/`cloud-localds`: `up / apply / reboot / assert / cycle / down / prune / console`. Lazily ensures the golden Debian-13 genericcloud image. pytest clean (transient-inventory generation, var/overlay merge, `--certs` mapping, DHCP-lease parsing, resource-guard math). |
| `tests/integration/` (profile, verify, overrides) | **Built** — "be askari" profile + var overlay + `verify.yml` outcome assertions (Docker active, forward-chain accepts present, published-port DNAT alive). Validated end-to-end by the RED→GREEN acceptance run. |
| `make test-integration` / `make test-integration-clean` | **Built** — wired into `Makefile`. |
| ADR-025 | **Accepted (2026-06-18)** — decision recorded, approach A, cert tiers, safety invariants, UEFI boot requirement, and claude-sudo dependency documented. |
| **RED/GREEN acceptance (ubongo live pass)** | **PASSED (2026-06-18).** A throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base nftables forward default-deny kills Docker forwarding on reboot) = RED. Applying the `docker_host` container-forward drop-in and rebooting survived = GREEN. Nine shakedown findings captured in `docs/FRICTION.md`; key learnings (UEFI boot, claude sudo) recorded in ADR-025. `docs/TODO.md` item 2.4 closed. |
| `le-staging` cert validation | **Pending** — wired in v1 but not yet exercised on a real VM (separate from the RED/GREEN acceptance gate). |
## Keeping this honest

View file

@ -1,11 +1,12 @@
[defaults]
inventory = inventories/production/hosts.yml
inventory = inventories/production/
roles_path = roles
collections_path = .collections
vault_password_file = scripts/vault-pass-client.sh
interpreter_python = auto_silent
stdout_callback = yaml
callbacks_enabled = timer, profile_tasks
stdout_callback = default
callback_result_format = yaml
callbacks_enabled = ansible.posix.profile_tasks
# Avoid slow DNS lookups
[ssh_connection]

View file

@ -24,13 +24,19 @@ decisions this frame enables.
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|---|---|---|---|---|---|
| Reverse proxy / TLS | Traefik | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
| Reverse proxy / TLS | Caddy (ADR-024) | P | core | Edge routing + ACME certs for everything exposed | Spin-up order names it (TODO 12) |
| Internal DNS | `dns` role → dns1/dns2 | P | core | Authoritative internal zone (ADR-007) | Ansible-rendered zone |
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only default; applied (M1) |
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
| Service portal / dashboard | Homepage | A | candidate | One landing page listing all services — a "what does what" front door | Gap surfaced by V4; fits boma's legibility goal |
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
role from a shared `group_vars` service catalog. The host `nftables` layer is built (the
`base` firewall concern); the OPNsense layer is still to be built._
## 2. Identity & access — [P]
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
@ -43,8 +49,9 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|---|---|---|---|---|---|
| Metrics | Prometheus | P | planned | Time-series metrics + alert rules | TODO 3.6 |
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
| Uptime checks | Uptime Kuma | P | planned | Endpoint up/down checks | TODO 3.6 |
| External watchdog | askari (Hetzner VPS) | P | core | Off-site monitoring that survives a homelab outage | ADR-007 |
| Notify / alerting | ntfy · Matrix · email (multi-channel) | S | planned | Deliver alerts to the user across channels | TODO 9; Matrix homeserver in §8 |
@ -98,9 +105,9 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
| Capability | Candidate service(s) | Tier | Commitment | What it does | Notes / open |
|---|---|---|---|---|---|
| Databases | Postgres/MariaDB — central *vs* per-app | P | candidate | Backing store for stateful apps | Open: central server vs per-service (TODO 3.9) |
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
| Backup engine | restic (data-only) | S | planned | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
| Off-site target | pCloud (via rclone) | S | planned | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
| Air-gap target | USB hard drives | S | planned | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
## 10. Operations & support — [S]
@ -109,6 +116,11 @@ _(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not cont
| Update watcher | DIUN | S | planned | New-image alerts driving the update process | ADR-011 |
| Scheduled jobs | `scheduled_jobs` role + `claude -p` jobs | S | planned | Declarative cron: `/review-repo`, security/capacity reviews, sanity checks | TODO 8 |
| Sanity / smoke | whoami + health checks | S | planned | Verification endpoints + "is it actually working" checks | ADR-011 / TODO 8.2 |
| Service-UI verification | `/verify-service` skill | S | planned | Claude-driven exploratory Level 4 acceptance check of a deployed service's UI | Decided (ADR-017); running deferred on ubongo + playwright + Authentik |
- **Targeted runs** (ADR-019): playbooks are sliced with `--tags` along two axes —
role/service (tag = role name) or a closed list of cross-cutting concerns
(`firewall`, `logging`, `config`, `deploy`, …); the vocabulary is lint-enforced.
---
@ -136,8 +148,11 @@ AI/LLM, a game server (Minecraft), generic static-site hosting. Plausible someda
none are committed.
**Confirmed exclusions (V4 had them; boma deliberately does not).** V4 mixed in a lot
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, nvim/kitty/tmux,
LibreOffice, antivirus, remote desktop. boma is **server-only**, so these are correctly
absent. Likewise the removed Knowledge domain (Discourse, Snipe-IT, MRBS booking) and
V4-specific project websites — out of boma's scope by design. The narrower surface is
intentional, not an oversight.
of **workstation/desktop** config — XFCE/GNOME desktops, kiosk mode, LibreOffice,
antivirus, remote desktop. boma's **managed cluster/server hosts** stay server-only, so
these are correctly absent. (One scoped exception: the control / AI-worker host `ubongo`
runs an interactive `dev_env` — zsh/tmux/neovim — per ADR-015; that is the developer
environment of an infrastructure worker host, not a personal desktop, and does not apply
to managed service hosts.) Likewise the removed Knowledge domain (Discourse, Snipe-IT,
MRBS booking) and V4-specific project websites — out of boma's scope by design. The
narrower surface is intentional, not an oversight.

View file

@ -1,13 +1,15 @@
# FRICTION.md — kaizen friction log
Raw signals for the periodic **kaizen review** (the methodology retrospective; see
`docs/TODO.md`). This is the input that keeps our tooling and conventions sharpening
over time instead of only accreting.
Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
the input that keeps our tooling and conventions sharpening over time instead of only
accreting.
**How to use:** append freely _during_ work — don't curate, don't fix here. Capture
friction, surprises, fixes that keep recurring, and tooling that isn't earning its
keep. The kaizen review reads this, then proposes **add / change / remove** (biased
toward _remove_) and records the decisions as ADRs.
**How to use:** append freely _during_ work under **Open signals** — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the **decisions ledger** below.
**Entry format:** `date — [tag] observation — (optional) → systematization idea`
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
@ -16,47 +18,330 @@ earning its keep.
---
## 2026-05-30 — initial seed (from the Claude-Code setup session)
## Open signals
- `[recurring]` Every `git commit` needs `rbw` unlocked (the pre-commit ansible-lint
hook decrypts `vault.yml` for its syntax-check). Mitigated with a 5h lock timeout
and an `rbw unlocked` pre-flight convention. → _Open:_ could ansible-lint skip vault
decryption for syntax-check, so committing doesn't need the vault at all?
- `[gotcha]` pre-commit stashes _unstaged_ changes before running hooks, so a partial
commit reverted an interdependent file (`ansible.cfg`) and failed. → Commit
interdependent changes together, or stage the config change first.
- `[gotcha]` `make new-role` had never worked on this host: `mkdir {a,b,c}` brace
expansion fails under `/bin/sh` (dash). Fixed with explicit paths. → A real run
catches what static review can't; consider smoke-testing scaffold commands.
- `[gotcha]` `rbw sync` is required after adding a Vaultwarden item before `rbw get`
finds it (stale local cache).
- `[gotcha]` This shell is zsh — unquoted `$VAR` does not word-split, so a variable
holding a file list was passed as a single argument. → Use explicit args/arrays.
- `[friction]` Long sessions: I make a batch of edits but can't commit until you
`rbw unlock`. The 5h timeout + pre-flight check address the symptom; watch whether
it still bites.
- `[gotcha]` Hooks (or any new `.claude/settings.json`) added mid-session don't
activate until a Claude Code **restart** — the settings watcher only tracks settings
files that existed at session start. Opening `/hooks` and dismissing did _not_ load
them. → Fresh sessions load them normally; restart after adding hooks.
_(append new raw signals here; the next kaizen review consumes them)_
## 2026-05-31
- `[friction]` **Re-asked settled defaults (push + subagent-driven) at the plan→execute handoff**
(2026-06-19): despite the standing preference (memory `dont-reask-settled-defaults`: push to
origin as off-machine backup **and** go subagent-driven, both WITHOUT asking), I again asked the
operator "which execution approach?" and "want me to push?". The `writing-plans` skill scripts
that handoff question ("Which approach?"), and confirming a push felt natural — both overrode the
memory. → at the writing-plans → execution handoff, default to subagent-driven execution and push
to origin without a confirmation gate; reserve questions for genuine forks. Recurrence of an
already-recorded signal — treat the skill's scripted "Which approach?" as pre-answered
(subagent-driven) for this operator.
- I asked to draft an ADR and got: No formal status-header convention, but since this is a draft for discussion I'll mark it Proposed so it isn't mistaken for an
accepted decision. Here's the draft.
<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +
a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*. -->
## 2026-06-01
- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot**
(2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`.
On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and
inter-container over the bridge), so the drop kills it. It worked right after `make
deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our
default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public
services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules"
that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base`
firewall (`base__firewall_apply`) must NOT be applied to any Docker host until
`docker_host` ships the container-forward rules; add a guard/check (a Docker host with
`firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the
firewall design (ADR-020) should state the Docker-host dependency explicitly.
- `[friction]` The `finishing-a-development-branch` flow (and generic AI/dev tooling)
offers "push and open a Pull Request," but our Forgejo `origin` is trunk-based with
no merge-request / approval gate (CLAUDE.md git conventions). That option doesn't
apply — the real path is local fast-forward merge to `main`, then push. → Skills and
conventions that assume a GitHub-style PR workflow need a homelab-aware variant;
encode that here "finishing a branch" means merge-locally-then-push, not open-a-PR.
- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the
mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set
`net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at
boot. In practice the console still showed sshd *"could not assign the address"* at boot
— so the protection did not work as designed, and because `wt0` never came up (the
coordinator was down), sshd had no listener at all → no SSH path. → the entire
"sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix
and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't
help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?),
or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.
## 2026-06-05
- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17):
`askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird
agent needs the coordinator (a local container) to be serving to bring up `wt0` — but
the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd
being `wt0`-only, the host was reachable only via the Hetzner console. → the
coordinator host must keep a **non-mesh management path always** (don't move its SSH onto
`wt0`), or the mesh-hardening must treat the coordinator host as a special case. General
rule: never make a host's only management path depend on a service that host itself
hosts.
- `[recurring]` The `writing-plans` skill ends by asking "subagent-driven vs inline
execution?" — always answer subagent-driven here. Don't ask; default straight to
subagent-driven (fresh subagent per task + review between tasks). → Standing
preference; skip the execution-mode prompt.
- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no
egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download
the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so
any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was
flushed, not re-added by a plain `restart docker`) takes the whole control plane down.
Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could
download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data
dir (or pin a local copy), or disable the geolocation requirement, so a transient egress
blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT)
is fragile across `nft flush` + reboot.
- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live
recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`,
encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending,
flagged in STATUS as the coordinator's deferred backup). During the incident there was a
real fear the unclean reboots had corrupted the store, with no restore path. It turned
out to be a runtime/egress issue, not corruption — but the absence of a backup made the
whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the
`netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy
would have made "rebuild askari from scratch" a safe option.
- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass**
(2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5)
*before* the reboot-resilience test (step 7), so the one fallback path was gone exactly
when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for
lockout-risky cutovers: **validate reboot-recovery while the old access path is still
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
<!-- The below are from the 2026-06-18 ADR-025 build: standing up the local-VM integration
harness on ubongo and shaking it down against real KVM (spec/plan in docs/superpowers/). -->
- `[gotcha]` **Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS** (2026-06-18):
`virt-install --import` of the genericcloud qcow2 with the default (SeaBIOS) firmware
triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no
DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test
VMs via **UEFI** (`virt-install --boot uefi`; OVMF→efistub).
- `[friction]` **The no-sudo `claude` model blocked diagnosing a failed VM** (2026-06-18):
under ADR-015 `claude` had no sudo, so when the VM wouldn't network there was no way to
introspect it (serial logs are `root:0600`, libguestfs not installed, mounting needs
root). Diagnosis was fully blocked until the operator granted `claude` sudo. → DECISION:
`claude` gets `NOPASSWD:ALL` (reverses ADR-015's "no local sudo"); compensating control
is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks;
codify the sudoers drop-in in Ansible.
- `[gotcha]` **Non-root `virsh`/`virt-install` default to `qemu:///session`** (2026-06-18):
the substrate (NAT net, /dev/kvm) lives on `qemu:///system`. → pin
`LIBVIRT_DEFAULT_URI=qemu:///system` in the driver.
- `[gotcha]` **`qemu:///system` (libvirt-qemu) can't traverse `/home`** (2026-06-18): VM
disk/seed/console under the repo/home failed "Permission denied (search permissions for
/home/claude)". → put per-VM artifacts in a system-readable dir (`/var/lib/boma-integration`,
group libvirt); the inventory (read by ansible as the user) can stay in the repo.
- `[gotcha]` **`ansible-playbook -i <dir>/` parses sibling non-inventory files as INI**
(2026-06-18): pointing `-i` at a run-dir holding a state file + qcow2s made the directory
inventory loader parse the state file as INI → phantom hosts INCLUDING the real `askari`
(with its real vars), breaking the single-host isolation invariant. → point `-i` at the
single `hosts.yml`. Caught by the holistic cross-file review BEFORE any hardware run.
- `[gotcha]` **Jinja `{%- -%}` + ansible `trim_blocks=True` double-strip newlines**
(2026-06-18): a template edit used `{%- -%}`, reviewed by rendering with RAW jinja2
(trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the
rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify
templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain
`{% %}` at column 0 (the repo's existing style).
- `[gotcha]` **Fresh cloud images have empty apt lists** (2026-06-18): `apt install
nftables` failed "No package matching 'nftables' is available" on a fresh genericcloud
VM whose cloud-init had `package_update: false`. → `package_update: true` AND block on
`cloud-init status --wait` before applying.
- `[gotcha]` **base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is
allowed** (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway
(192.168.150.1). `ct established,related accept` saves the in-flight apply connection,
but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets
`base__firewall_control_addr` to the NAT gateway.
- `[recurring]` **Real-hardware shakedown and static review each caught what the other
couldn't** (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render
bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
cross-file review AND a real-hardware run; neither alone would have shipped it working.
<!-- From the 2026-06-19 mesh-hardening-2/3 design (ubongo INPUT-only default-deny). -->
- `[friction]` **Raw DHCP leases pinned in ubongo's host firewall (admin-addr SSH allows)**
(2026-06-19): mesh-hardening 2/3 lets the operator workstations reach ubongo's LAN SSH by
*raw lease*`base__firewall_admin_addrs: ["10.20.10.50" (mamba), "10.20.10.17"]` — because
there is no DHCP reservation yet (OPNsense isn't managed as code). A lease reassignment
silently moves the allow to whatever host next holds the IP (still SSH-key-gated) and drops
the workstation's *LAN* path (mesh still works, so never a full lockout). → when
OPNsense-as-code lands (ADR-020 perimeter / TODO 3.5), replace both with **MAC-pinned DHCP
reservations** (`10.20.10.17` = MAC `bc:0f:f3:c8:4a:8a`; mamba's MAC TBD) and allow the
reserved IPs. Spec: `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`.
- `[gotcha]` **`make test-integration` on ubongo fails (`qemu-img` "Permission denied") when
the agent session predates the `libvirt` group grant** (2026-06-19): the `integration_test`
role adds `claude` to `libvirt`+`kvm` and makes the cache dir `/var/lib/boma-integration`
`root:libvirt 2775` — correct — but a `claude` session whose shell started *before* that
grant carries a stale process group set (`id``claude,docker` only, no `libvirt`), so
`qemu-img create` of the VM overlay into the group-owned dir is denied. `virsh`/`virt-install`
still work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side
as `libvirt-qemu`), so ONLY claude's own file-writes break. Unblock without restarting the
session: **`sg libvirt -c 'make test-integration HOST=<name>'`** (claude needs only `libvirt`
for the dir; `kvm` is server-side; note `sg` adds one group, not the full set). → self-heal
in `scripts/integration-vm.py`: if the `libvirt` gid is absent from `os.getgroups()`, re-exec
under `sg libvirt` (or have the Makefile target do it), so a stale-session agent never hits
this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session
transient — but high-confusion, worth self-healing.
- `[friction]` **No standard for when the agent may run local-VM integration tests on ubongo
without asking** (2026-06-19): `make test-integration HOST=<name>` spins an ISOLATED throwaway
KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards:
one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and
self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3,
Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent
just runs it. → decide + record the rule: e.g. a `.claude/settings.json` permission allow for
`make test-integration*` / `scripts/integration-vm.py` (and the `sg libvirt -c '…'` form per
the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests
from the genuinely-gated live steps (`make deploy` to real hosts, host reboots, cutovers —
still need a go-ahead). Ties to the `test-risky-infra-before-live-deploy` +
`dont-reask-settled-defaults` memories + ADR-025.
- `[gotcha]` **Molecule covers only the `input_only`-OFF (forward drop) branch of the base
firewall** (2026-06-19): mesh-hardening 2/3 added `base__firewall_input_only` (forward policy
drop↔accept). The `default` Molecule scenario renders ONE fixture, set to the secure default
(drop) — so the fast `make test ROLE=base` gate locks the drop default (security-critical for
service hosts) but does NOT exercise the `=true` → forward-`accept` rendering; only `make
test-integration HOST=ubongo` does (passed GREEN). An in-converge re-render can't cheaply
cover it (role defaults aren't in scope outside the role run). → decide in kaizen: a second
Molecule scenario (`molecule/input-only/`) asserting forward `policy accept`, vs accepting the
integration-only coverage. Final-review finding; not a cutover blocker (the accept branch is a
literal, and a var-name break would fail the drop branch too → caught).
- `[gotcha]` **Applying base's firewall to a Docker host flushes Docker's nat → container
egress dies until `restart docker`** (2026-06-19, mesh-hardening 2/3 live cutover): base's
`nftables.conf.j2` starts with `flush ruleset`, which wipes ALL tables incl. Docker's
`ip nat`/`ip filter` (+ libvirt's). On ubongo I chose INPUT-only so `forward` stays `accept`
— yet the apply STILL broke CONTAINER egress: `docker pull` worked (dockerd uses HOST egress)
but a container `ping` FAILED — the masquerade (SNAT) was gone, so replies couldn't return.
`forward accept` permits forwarding but can't replace the missing nat. The spec's "input-only
keeps Docker egress working" was therefore **incomplete**, and the local-VM harness couldn't
catch it (the test VM runs no Docker). Fix on the live host: `systemctl restart docker`
re-adds its `ip nat`/`ip filter` (egress restored; coexists fine with base's `inet filter`).
On REBOOT it self-heals (dockerd re-adds nat on boot; `forward accept` doesn't block — unlike
the 2026-06-17 `forward drop` incident). → (1) any cutover/runbook applying base firewall to a
Docker host MUST `restart docker` + check container egress after the apply; (2) the pending
`docker_host` nftables integration should own re-adding/persisting Docker's rules so base's
`flush` is safe; (3) the firewall final-review checklist should include "does the host run
Docker/libvirt? the flush wipes their nat."
<!-- From the 2026-06-19 mesh-hardening 3/3 (askari INPUT-only integration gate). -->
- `[gotcha]` **`inet filter` default-deny blocks libvirt dnsmasq DHCP — silent, hard to diagnose**
(2026-06-19, task-3 integration gate): when `base__firewall_input_only: true` is applied to
ubongo, the `table inet filter { chain input { policy drop; } }` blocks DHCP packets that arrive
via the libvirt bridge (`virbr-boma`). In nftables, multiple tables at the same hook priority all
run independently; an `accept` verdict in `table ip filter LIBVIRT_INP` does NOT prevent
`table inet filter` from seeing and dropping the same packet. VMs never got DHCP leases (dnsmasq
socket confirmed by strace to never receive POLLIN despite tcpdump seeing the packet on
`virbr-boma`). Diagnosed by temporarily changing `inet filter input` to `policy accept` → fd=3
immediately fired. Fix: `/etc/nftables.d/10-libvirt-boma.nft` drop-in adding
`iifname "virbr-boma" accept` (survives service restarts via `include "/etc/nftables.d/*.nft"`).
→ The `base` role's template needs a `base__firewall_trusted_bridges` variable so this is
encoded at the Ansible level, not in a manual host drop-in. Every host that runs Docker or
libvirt and also has `base__firewall_input_only: true` needs an analogous exception.
- `[gotcha]` **libvirt `leaseshelper` PID-file permission: `virPidFileReleasePath` unlinks
`/run/leaseshelper.pid` after EVERY call; nobody cannot recreate it** (2026-06-19, task-3
integration gate): dnsmasq runs as nobody; `libvirt_leaseshelper` is its `--dhcp-script`. The
helper acquires a PID-file mutex at `/run/leaseshelper.pid`, but `virPidFileReleasePath`
UNLINKS the file on exit. `/run/` is `root:root 755`, so nobody cannot create the file after the
first unlink → every subsequent `add` call fails with `errno=13`, dnsmasq silently drops the
DHCP grant (no log, no error to the client). Fix: suid root C wrapper at
`/usr/lib/libvirt/libvirt_leaseshelper` (original moved to `.real`) that pre-creates
`/run/leaseshelper.pid` owned by nobody, then drops privileges and execs the real helper. The
root dnsmasq fork calls the wrapper; suid gives it permission to touch `/run/`; on return to
nobody uid the PID file stays. Also: `/var/lib/libvirt/dnsmasq/` must be `nobody:nogroup 775`
so leaseshelper can update `virbr-boma.status`. This fix is host-local on ubongo and NOT in
Ansible — encode it in an `integration_test` role task (or a libvirt role) before the harness
can be safely re-deployed.
- `[gotcha]` **cloud-init rejects underscores in `local-hostname` → silently skips
network-config → VM never gets DHCP** (2026-06-19, task-3 integration gate): setting
`local-hostname: boma-it-askari_inputonly-<uuid>` caused cloud-init-local to consider the
hostname invalid and skip writing the network-config to the system. Systemd-networkd then
used the genericcloud default (no DHCP), so VMs got only IPv6 link-local. Fix in
`scripts/integration-vm.py`: `name.replace("_", "-")` in the meta-data hostname (disk paths
and virsh domain names keep the original underscore). Sanitization rule: RFC-952 hostnames
allow hyphens, not underscores.
- `[friction]` **Molecule Docker image can't `apt install` → roles with real package tasks
have no Molecule substrate coverage** (2026-06-19): the Docker Molecule image ships with
cleared apt-lists and no internet access, so any role whose core work is `apt install`
`base`, `docker_host`, `integration_test` — cannot cover its package/substrate tasks in
Molecule. Those tasks are validated only by `make test-integration` (ADR-025, real KVM).
The gap is systemic: it affects every role with non-trivial package or system-level setup.
→ systematization idea: provide a Molecule image or driver that can install packages (e.g.
a custom Docker image with pre-seeded apt-lists, or a `prepare.yml` that pre-installs
packages from a local cache), or an alternative driver (e.g. `molecule-libvirt` using the
same KVM harness), so substrate tasks get real Molecule unit coverage rather than relying
entirely on the integration harness.
---
## Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first.
### 2026-06-17
Second `/kaizen` run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items
(the `rename-incomplete` scan check and the Forgejo registry-login path) were built by
parallel subagents and verified against the diff. **Bias-to-remove note:** one PARK
(the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero
REMOVE; the rest accreted (migrate/change). None of the open signals were `[unused]`
*tooling*, so there was nothing to delete — the only reductive move available was parking
the out-of-phase build. **Cadence:** healthy — 3 days after the first run, every signal
02 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in
`scripts/friction-scan.py` didn't fire this pass (all recurrence counts were 1), so the
thresholds need no change.
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| ADRs claim cross-doc reconciliation they didn't perform (06-14) | SYSTEMATIZE | New `rename-incomplete` check in `scripts/repo-scan.py` (+7 tests): when a numbered ADR announces a rename `Old``New`, flag any design-doc line where `Old` still appears in present tense (skips the announcing ADR, lines also naming `New`, and historical/negation cues; rejects `ADR-NNN` tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of `stale-deferred`; run by `/review-repo`. (Was KEEP-OPEN on 2026-06-14 — now built.) |
| Image push to the Forgejo registry needs an interactive `docker login` (06-15) | SYSTEMATIZE → vault | Vault-backed login path so pushes are agent-completable: `vault.forgejo.registry_token` stub (CHANGEME, operator-minted) + `scripts/registry-login.sh` (reads the token, `docker login --password-stdin`, never echoes it) + `make registry-login` + a prereq note in `docs/runbooks/claude-code-setup.md`. Works once the operator fills the token via `make edit-vault`. |
| Single-file bind mount + atomic rewrite = stale config (06-16) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": `template` writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config *directory* for reload-in-place roles; restart-based roles are fine with a single-file mount. |
| `make check` always fails on the first-ever deploy of a compose service role (06-16) | CHANGE | `check_mode: false` on the `state: directory` scaffold tasks in `roles/reverse_proxy` + `roles/netbird_coordinator`, so the base dirs exist under `--check` and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing `project_src`. Inert under converge → Molecule unchanged. |
| Re-asked settled defaults — push + execution mode, in prose (06-17) | CHANGE (exec) + ACCEPTED (push) | Widened `.claude/hooks/guard-execution-mode-menu.sh` to also catch free-form *prose* re-asks of the subagent-vs-inline choice (`"which execution approach?"`, `"subagent vs inline"`, …), not just the literal menu; tested. The push re-ask stays a soft default via the `dont-reask-settled-defaults` memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked. |
| Docs-only commit tripped the rbw-locked pre-commit guard (06-17) | CHANGE | Root cause was NOT the ansible-lint `files:` scope (innocent) — it was `.claude/hooks/guard-vault-preflight.sh` blocking *every* locked `git commit`. Rewrote it to inspect the staged set (`git diff --cached`, plus `-a`/`--all`) and block only when Ansible content (`^(roles\|playbooks\|inventories)/.*\.ya?ml$`) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested. |
| Agent can't self-manage `ubongo` (the control node it runs on) without operator grants (06-17) | PARK | The knowledge already lives in `STATUS.md` (control-node row: the interim `claude`-key + `sjat` NOPASSWD grants, and **Pending:** the proper `ansible`-user bootstrap) and the `ubongo-self-sufficiency` memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. **Resurrection trigger:** when building ubongo's `base` hardening / `ansible`-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants. |
### 2026-06-14
First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. |
| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |
### 2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
built). The 2026-06-14 block was the **first run of `/kaizen`** itself
(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
cleared the backlog and validated the command.

View file

@ -6,6 +6,15 @@ Project documentation.
Numbered from 001; each records context, the decision, and what was ruled out.
- `runbooks/` — step-by-step operational procedures (add a host, add a role, rotate
secrets).
- `security/` — security baseline, accepted-risk register, per-service checklist +
template (ADR-002/004).
- `testing/` — testing methodology artifacts + the `VERIFY.md` template (ADR-008/017).
- `access/` — operational-access doctrine + the `ACCESS.md` template (ADR-021).
- `backup/` — backup doctrine + the `BACKUP.md` template (ADR-022).
- `hardware/` — capacity reference + `/capacity-review` output (ADR-012).
- `reviews/``/review-repo` audit trail.
- `CAPABILITIES.md` / `ROADMAP.md` / `TODO.md` / `FRICTION.md` — what boma does, the
build order, the backlog, and recurring-friction notes.
For what is actually **built vs only designed**, see `STATUS.md` at the repo root —
the ADRs describe intent, not necessarily current reality.

227
docs/ROADMAP.md Normal file
View file

@ -0,0 +1,227 @@
# ROADMAP — boma build order
High-level **build order** for the project. Almost everything in `docs/decisions/`
(the ADRs) is *designed, not built* — this file sequences that backlog into milestones
and records *why* the order is what it is.
- **What is built vs planned:** `STATUS.md` (ground truth — always check there first).
- **The backlog of decisions:** `docs/TODO.md` (this roadmap sequences it).
- **The design rationale:** `docs/decisions/` (ADRs).
This is a **living document**: update it as milestones land (move them to `STATUS.md`),
as ordering changes, or as new milestones appear. Each milestone gets its own
spec → plan → implementation cycle (`docs/superpowers/specs/` then `…/plans/`) when it
comes up; this file stays high-level.
_Last updated: 2026-06-19._
---
## Strategy — "remote-access first" (Approach A)
One focused track now (**Off-site / Remote-access**), a **procurement gate**, then the
**Cluster** track. Cross-cutting/ongoing work runs underneath both.
**Why this order.** The only physical machine that exists today is `ubongo` (the control
node); the Proxmox cluster is a procurement decision, not yet made. The nearest-term goal
— reach `ubongo` from `mamba` / a work laptop while on the move — needs only things
already available or cheap to spin up (`askari` at Hetzner, the laptops). Doing the
remote-access track first:
1. **delivers the mobile-access goal in the first phase**, and
2. **doubles as the proving ground** for boma's core machinery — the first real *service
role* (NetBird), the `base` role on a *real, internet-facing* host, the `offsite_hosts`
pattern, public DNS + ACME, the backup contract, and `rbw`/vault in anger — all on two
cheap, low-stakes hosts **before** spending on the cluster.
Cluster hardware is then procured *after* those patterns are proven and a
`/capacity-review` informs the sizing — so the spend happens once, with knowledge.
Rejected alternatives: **B — procure now, build strictly bottom-up** (mobile access lands
late; spend precedes any proven pattern). **C — two parallel tracks** (for a solo operator
this collapses into interleaving with extra context-switching cost).
---
## Phase 1 — Off-site / Remote-access — ✅ COMPLETE (2026-06-17)
Delivers mobile access to `ubongo`; proves the machinery. Ordered by *real* dependencies.
All milestones (M1M5) done; the mobile-access goal is met. Next: the Procurement gate.
### M1 · boma's DNS home — a new domain at Gandi, managed as code
Register a **new Swahili-themed domain at Gandi** for boma and manage its records **as
code (IaC)**. Greenfield, not a migration: investigating the existing domains ruled them
out as boma's home — `baobab.band` is the **live legacy homelab** (Cloudflare; vaultwarden
/ nextcloud / matrix in daily use), and `ziethen.dk` is the **family's primary email**
(Fastmail); moving either's authoritative DNS risks breaking production. A fresh domain is
zero-risk and *born at Gandi*.
- **Driver:** values/sovereignty (Gandi) + a clean, decoupled home so boma builds without
endangering anything live. `baobab.band`'s Cloudflare exit / V4 decommission is a
**separate, later track**, not part of this build. `ziethen.dk` is untouched.
- **IaC approach:** follow boma's grain — internal DNS is already Ansible-rendered and
Terraform owns *no* DNS (CLAUDE.md), so **public DNS is Ansible-managed too** (Gandi
LiveDNS via an Ansible module — exact module pinned in M1's spec, verified per ADR-014).
- **Naming scheme (decided):** three tiers (on boma's new domain, `<boma-domain>`) —
`<host>.boma.<boma-domain>` (infra, internal-only) · `<service>.<boma-domain>`
(home/cluster services, split-horizon) · `<service>.askari.<boma-domain>` (off-site/VPS,
public). **`nyumbani` dropped.** Home services are **mesh/LAN-only by default** (no
public record; reached over LAN or the NetBird mesh), with public Gandi records only for
deliberate exceptions. The NetBird mesh carries the `<boma-domain>` match-domain to
road-warriors (resolver = dns1/dns2 over `wt0`); a `*.<boma-domain>` ACME **DNS-01**
wildcard cert (Gandi API) gives even unexposed services real TLS. Resolves TODO 4 and
review finding O12.
- **Records as a new/updated ADR:** amends ADR-007 — boma's public zone is
`<boma-domain>` at Gandi LiveDNS managed as code; the three-tier naming scheme;
`nyumbani` removed; mesh/LAN-only default; `baobab.band` (legacy, Cloudflare) is out of
scope.
- **Maps to:** ADR-007 (network/DNS), ADR-016 (mesh DNS), TODO 4 (**resolved here**).
### M2 · `askari` provisioned + under Ansible
Provision the Hetzner VPS **as IaC with Terraform** (Helsinki / Debian 13, behind a
TF-managed Hetzner Cloud Firewall), bring it into `offsite_hosts`, and bootstrap it.
**Shipped as cx23/x86** (CAX11/ARM was out of stock EU-wide on 2026-06-14 — same-spec
x86, cheaper). Design: `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`.
- **Decided:** Terraform owns `askari`'s existence — generalizes ADR-006 from "Proxmox VM
existence" to **Proxmox + Hetzner** (new `hetznercloud/hcloud` provider, `hetzner_vm`
module, `offsite` stack). Token via `TF_VAR_hcloud_token` from `vault.hetzner.token`.
- **Proves:** the `offsite_hosts` pattern, the TF→Ansible handoff for a non-Proxmox host
(`tf_to_inventory.py` extended), bootstrap of a non-cluster host. Closes review finding
O6 (`offsite_hosts` missing from `hosts.yml`).
- **Amends:** ADR-006 (TF scope), ADR-009 (offsite handoff), ADR-020 (Hetzner Cloud
Firewall = perimeter), ADR-007/016 (`askari` TF-provisioned, not "added manually").
### M3 · `base` matured to a "remote-access-sufficient" subset — ✅ DONE
Added the `hardening` concern to `base` (sshd drop-in key-only + `PermitRootLogin no`;
fail2ban sshd jail 5/1h; ADR-002) and **applied it to askari** by tag
(`make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening`) — SSH still works, fail2ban
active. Full CIS L1/L2, auditd, AppArmor, AIDE remain deferred to Phase 2 (TODO 15).
- **NetBird agent → M4** (deferred from M3: it enrolls against the coordinator, which
doesn't exist until M4 — ADR-016's coordinator-first bootstrap order).
- **Host firewall on askari + ubongo hardening → M5** (applying default-deny pre-mesh
would lock out SSH; the Hetzner Cloud Firewall is askari's perimeter until then).
- **Spec/plan:** `docs/superpowers/{specs,plans}/2026-06-14-base-ssh-fail2ban-m3*`.
- **Maps to:** ADR-002 (security baseline), ADR-020 (firewall — built, not yet applied),
TODO 15 (the rest of hardening → Phase 2).
### M4 · NetBird control plane on `askari` — first real service role
Built in two phases. **M4a (platform) — ✅ DONE:** Docker on askari + boma's standard
**Caddy** reverse proxy (ADR-024), proven by `https://test.askari.wingu.me` serving a
valid Let's Encrypt cert (HTTP-01; the Gandi **DNS-01** path is now built + proven —
2026-06-15, see ADR-024 — for mesh/LAN-only cluster services).
Firewall opened 80/443/3478. Spec/plan: `…2026-06-14-netbird-coordinator-m4-design.md` /
`…2026-06-14-m4a-docker-caddy.md` / `…2026-06-14-m4b-netbird.md`.
**M4b — ✅ DONE (2026-06-16):** the `netbird_coordinator` service role, deployed to askari.
Reality differed from the original plan (captured fresh per ADR-014): NetBird **v0.72.4**
ships a **single combined `netbird-server`** container (management + signal + relay + STUN
+ **embedded Dex** IdP at `/oauth2`) plus `dashboard:v2.39.0` — **no separate signal/relay
container and no Coturn**. Fronted by the M4a Caddy via gRPC-h2c + WebSocket + path routing.
Dashboard live at `https://netbird.askari.wingu.me` (valid LE cert); `/api` auth-gated.
**M5 (enrol peers) is next** — incl. the first-boot `/setup` admin + setup keys.
- **First exercise of:** the service-role conventions (`SECURITY.md` / `VERIFY.md` /
`ACCESS.md` / `BACKUP.md`), public **TLS / ACME**, and the **backup contract**
NetBird's management datastore is *stateful*, so it gets encrypted off-host backup
(ADR-016 §recovery, ADR-022).
- **Open design choice (decide in M4's spec):** a minimal ACME-terminating reverse proxy
(e.g. Caddy) just for NetBird on `askari`, vs leaning on NetBird's bundled setup.
- **Maps to:** ADR-016 (mesh), ADR-004 (one service = one role), ADR-021 (access),
ADR-022 (backup), ADR-008/017 (VERIFY), accepted-risk R3 (askari public surface).
### M5 · Enroll peers → goal reached — ✅ DONE (2026-06-17)
The `base` `mesh` concern enrolled **`ubongo` (`100.99.146.14`) + `askari`
(`100.99.226.39`)** as NetBird peers — both Management+Signal Connected, the ubongo↔askari
mesh link ping-verified. NetBird ships a default **Allow-All** peer policy, so any enrolled
peer reaches `ubongo` over `wt0`. The road-warrior clients (**`mamba` + the work laptop**)
are enrolled (operator, via `docs/runbooks/netbird-client.md`) → **`ubongo` is reachable
from anywhere. ← the mobile-access goal is met; Phase 1 is complete.**
- **Deferred to a "mesh-hardening" follow-on** (was folded into M5; split out as the
lockout-risky part): apply `base` nftables **default-deny** to `ubongo` + set
`base__firewall_control_addr` (ADR-021 `ssh-from-control`, built/dormant); tighten the
NetBird ACL off Allow-All to scoped policies; move `askari`'s SSH onto `wt0` (retiring
the Hetzner-firewall WAN allow). Safe to do now that the `wt0` path exists.
- **Maps to:** ADR-016, ADR-021 (SSH ladder: `wt0` + ssh-from-control), ADR-020.
---
## Gate — Procurement decision
Run `/capacity-review` (intent-based) to size the cluster, **then procure the Proxmox
hardware**. Every core pattern (service role, base-on-real-host, DNS+ACME, backup, access)
has by now been rehearsed on two cheap hosts, so the spend happens once and informed.
- **Maps to:** ADR-012 (hardware & capacity), `/capacity-review`.
---
## Phase 2 — Cluster (gated on procurement; coarse until M5 is near)
Canonical dependency order:
1. **Terraform provisioning**`terraform init`/apply the Proxmox VM module; regenerate
inventory via `make tf-inventory` (ADR-006, ADR-009).
2. **`base` full** — CIS L1/L2, auditd, AppArmor (enforce), AIDE, packages, users; the
VM disk layout for CIS L2 is decided **before** provisioning (ADR-002, TODO 15).
3. **`docker_host`** — real Docker engine + Compose, daemon hardening, `nftables.d`
container rules (currently a scaffold; ADR-004, ADR-020).
4. **`dns` role** — render the internal zone from inventory (ADR-007).
5. **Auth + reverse proxy** — Authentik + **Caddy** (ADR-024): the foundation every
service sits behind with authentication (ADR-002).
6. **Monitoring** — Loki + Grafana Alloy (logging, ADR-018) + Prometheus/exporters +
Uptime Kuma; decide which alerts live where (TODO 3.6).
7. **Service roles** — PhotoPrism, email, indexers, … (`docs/CAPABILITIES.md`); each
clears `docs/security/service-checklist.md` and carries its standard files.
8. **`backup` role + `fisi` pull node** — restic Model A, pCloud + USB air-gap (ADR-022).
9. **Forgejo Actions CI** — runner + workflows (ADR-003/010, TODO 1).
---
## Underneath both — Cross-cutting / ongoing
- **Accept ADR-011** (update management) — resolve its 6 open questions before the first
scheduled patch run (TODO 16).
- **Kaizen `/retro`** + keep appending to `docs/FRICTION.md` (TODO 11); **`/security-review`**
skill (TODO 8.5); **`/review-repo` fortnightly cron** + headless email (TODO 8.1);
`scheduled_jobs` role (TODO 8.3).
- **User-notification function** — ntfy / matrix / email so tools + AI can reach the
operator (TODO 9; ties to ADR-011 control channel).
### Parked decisions — decide when they bite, not before
- Split-horizon FQDN with or without `nyumbani` (TODO 4) — likely settled in M1.
- Central database server vs per-app databases (TODO 3.9) — at the service phase.
- Script-dependencies policy: stdlib-only vs selective libraries (TODO 14).
- Keep the custom Molecule base-image method as testing matures (TODO 3.10).
---
## Next step
**Phase 1 complete (M1M5); mesh-hardening: ubongo (2/3) DONE 2026-06-19, askari redesign DONE 2026-06-20.**
Both hosts now run INPUT-only nftables default-deny (`base__firewall_input_only`), live reboot-validated.
askari's redesign (spec/plan `docs/superpowers/{specs,plans}/2026-06-19-mesh-hardening-askari-redesign*`)
applied INPUT-only default-deny + `wt0`-primary SSH + a permanent WAN break-glass + a geo-disabled
coordinator; a real reboot recovered unattended. Remaining mesh-hardening sub-projects:
1. ~~`ubongo` nftables default-deny + `ssh-from-control`~~ → **DONE (2026-06-19).**
2. ~~**redesign** `askari`'s SSH → `wt0`~~**DONE (2026-06-20)** — boot-race, coordinator-bootstrap
chicken-egg, and Docker-nat-flush all resolved + live reboot-validated.
3. ~~**askari relay-SPOF reduction**~~**DONE (2026-06-20)** — assessed + **accepted** as a
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
**Then** the Procurement gate (`/capacity-review` → buy Proxmox hardware) opens Phase 2.

View file

@ -1,47 +1,53 @@
# ToDo
> **Build order lives in `docs/ROADMAP.md`** — that sequences this backlog into
> milestones. This file is the decision backlog; the roadmap is the order we build them.
>
> **Open items only.** Item numbers are stable cross-references (cited by ROADMAP,
> STATUS, ADRs, scripts) — **never renumber**. When an item is decided or built, collapse
> it to a one-line pointer in place; the full record lives in its ADR / `STATUS.md` / the
> `FRICTION.md` decisions ledger.
1. **Forgejo CI** — what CI work remains after ADR-010 (which workflows, runner
setup, etc. still need to be built)?
2. **Testing**
1. Choose and configure code-testing tooling (Molecule, etc.).
2. Decide how the AI interprets Molecule output and performs live testing:
API calls, curl pulls of web products, log reviews, and headless browsing.
3. Define a standard for generating test users and for instructing the user to
perform relevant manual tests.
2. Decide how the AI interprets Molecule output and performs live testing — API
calls, curl pulls of web products, log reviews. Headless browsing → ADR-017
(`/verify-service`); the API/curl/log-review siblings remain open.
3. ~~Standard for test users + manual-test instructions.~~ → ADR-017.
4. ~~Local VM integration testing on ubongo.~~ → ADR-025 / `make test-integration` (built + RED→GREEN validated 2026-06-18).
3. **Building services**
1. Decide how to manage logs.
2. Decide how to manage APIs / API access.
3. ~~Decide how to import or integrate from baobabAnsibleV4.~~ DECIDED (ADR-013):
translate-don't-transplant — V4 is a source only of gotchas + working config
snippets, re-derived on boma's terms; never structure/requirements/values.
1. ~~Decide how to manage logs.~~ → ADR-018.
2. ~~Decide how to manage APIs / API access.~~ → ADR-021.
3. ~~Decide how to import/integrate from baobabAnsibleV4.~~ → ADR-013.
4. Decide what each node runs — base packages plus which apps/services.
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
Kuma alerts on askari.
7. Define a tagging standard that lets us target runs without over-tagging.
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
5. ~~Decide the firewall strategy.~~ → ADR-020 (builds: host nftables in `base` done; OPNsense-as-code pending).
6. Wire up the monitoring stack — Prometheus + metric exporters, Uptime Kuma, and
exactly which alerts live where. (Logging topology → ADR-018.)
7. ~~Define a tagging standard.~~ → ADR-019.
8. ~~Ensure the right things are backed up.~~ → ADR-022 (build: the `backup` role, Plans 23, pending).
9. Decide: a central database server, or individual database services per app?
10. Should we continue to use the base-container method, or maybe something in the improvements of the methods in boma moods the point?
10. Should we keep the custom base-container (Molecule test image) method for role
testing, or revisit it as boma's testing approach matures (ADR-008)?
11. ~~Deliberate tagging strategy.~~ → ADR-019 (folded into 3.7).
4. **Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?
4. ~~**Split-horizon FQDN.**~~ → ADR-007 / M1 (`wingu.me` three-tier; `nyumbani` dropped; mesh/LAN-only default).
5. **Control node**
1. Set up and test the control node while waiting for hardware.
2. Define control-node bootstrapping — a dedicated recipe and playbook?
3. Decide the role of mamba — access/availability vs compute power and ease?
4. Set up rbw on the control node.
3. Set up rbw on the control node.
6. **Updating**
1. Decide pinning vs latest for versions.
2. Decide the update strategy across services & containers vs packages &
builds / GitHub pulls / Flatpaks.
3. Define scheduling of updates and reboots, including post-update testing.
6. **Updating** — 1. Decide the update strategy across services & containers vs packages
& builds / GitHub pulls / Flatpaks. 2. Define scheduling of updates and reboots,
including post-update testing. (Tracked in item 16 / ADR-011.)
7. **Shell setup**
1. Decide what shell setup matters for the AI's work on the control node.
2. Decide what to set up on the hosts, given that direct access will be rare.
2. ~~Decide what to set up on the hosts (direct access rare).~~ → ADR-021.
8. **Scheduled work**
1. Run `/review-repo` as `claude -p` via cron every two weeks?
@ -67,45 +73,44 @@
accepted-risk register (`docs/security/accepted-risks.md`). Could pair a
deterministic pre-scan (undeclared open ports, disabled baseline controls,
world-readable secrets, services not behind auth) with a judgement pass.
Open question: standalone, or folded into the kaizen `/retro` (item 11)?
Open question: standalone, or folded into `/kaizen` (item 11)?
9. Should we make a basic function so that tools (and AI) can send messages to the user - email, matrix or ntfy?
10. **Claude setup** — DECIDED: brainstorm for intent, capture as ADRs (skip plan
files); hooks + slash commands + `/review-repo` for enforcement at scale. Any
remaining setup to carry out from this decision?
1. ~~Policy for how we collaborate with references to baobabAnsibleV4 without misusing it.~~ DECIDED — ADR-013.
2. Policy for how we write key documents like ADRs.
3. Further development on how we we collaborate on designing the foundation for the project - seperate from how we implement new containers etc.
4. ~~How do we make sure agents always use the latest official documentation for the technologies etc. we use?~~ DECIDED — ADR-014 (facts → version-matched docs, cited + stamped; best practices → translated per ADR-013; risk-based triggers; graceful fallback to WebFetch).
5. Always subagent driven?
10. **Claude setup** — DECIDED: brainstorm for intent → ADRs; hooks + slash commands +
`/review-repo` for enforcement at scale. Remaining:
1. ~~V4 collaboration policy.~~ → ADR-013.
2. ~~Policy for how we write key documents like ADRs.~~ → ADR-023.
3. Further development on how we collaborate on designing the foundation for the project - separate from how we implement new containers etc.
4. ~~Always-latest official documentation for our tech.~~ → ADR-014.
5. ~~Always subagent-driven?~~ → DECIDED: yes (standing agreement; enforced by `.claude/hooks/guard-execution-mode-menu.sh`).
6. When AI deploys, i.e. runs playbooks etc., should we make a methodology so that it does not have to poll all the time or review all the output. Perhaps something about the MAKE method could provide only the relevant feedback?
7. ~~Reproducible agent toolchain (surfaced by ADR-014).~~ DONE — repo
`.claude/settings.json` declares `extraKnownMarketplaces` + `enabledPlugins`
(active set: superpowers · context7 · terraform · claude-md-management) and a
conservative permissions allowlist; bootstrap procedure in
`docs/runbooks/claude-code-setup.md`. Deferred plugins listed there with
triggers. (Plugin install is still a per-machine `/plugin` action — no native
auto-install.)
7. ~~Reproducible agent toolchain.~~`.claude/settings.json` + `docs/runbooks/claude-code-setup.md`.
8. **Screenshot hand-off to the agent.** Give the operator a smooth way to hand the
agent a screenshot (e.g. of a Hetzner/VNC console during an incident) — the agent
can already read image files; the gap is the hand-off. During the 2026-06-17
incident the only diagnostic channel was console screenshots, copied manually to
`/tmp` and `find`-located. Options: a known drop path the agent checks (e.g.
`~/screenshots/`), a small `screenshot`/paste helper or slash-command, or a
clipboard→file convention. Cheap, high-value for incident work.
11. **Kaizen loop** — set up ~2026-06-06 (one week from now).
1. Build `/retro`: reads `docs/FRICTION.md` + recurring `/review-repo`
findings + a tooling-usage inventory; proposes add / change / **remove**
(biased to remove); records decisions as ADRs; evaluates itself.
Recurrence-triggered plus a light periodic sweep.
2. Keep appending raw signals to `docs/FRICTION.md` (live now) until the
retro consumes them.
11. **Kaizen loop**`/kaizen` built (STATUS).
1. ~~Build the loop command.~~`/kaizen` (`scripts/friction-scan.py` + `.claude/commands/kaizen.md`; spec `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`).
2. Keep appending raw signals to `docs/FRICTION.md` (ongoing practice; see FRICTION.md).
3. **Automation deferred** (revisit when the notify + cron stack is up): wire a
**scheduled headless** run — report-only (proposes verdicts + notifies, does not
auto-curate/commit). The on-demand command + recurrence/age nudge ship now.
12. **Spin-up order** — what is the right order of operations when spinning up
from scratch (OS, DNS, Authentik, Traefik, …)?
12. **Spin-up / build order** — what is the right order of operations when spinning up
from scratch (OS, DNS, Authentik, Caddy, …)?
13. **Intentions** - Is the current setup clearly identifying intentions throughout? We have the readme files but is that enough? Also, how do we rechallange desisions and how they interact over time. I.e. We have these two services running, but extending one a little bit could make the other redundant so we could remove it. Or an alternative to this services has emerged, and it is actually better.
14. **Script dependencies policy** — utility scripts (`tf_to_inventory.py`,
`repo-scan.py`, `capacity-scan.py`) are stdlib-only by convention, for
run-anywhere portability (control node, CI, bare clone, no venv). Reevaluate
whether selectively allowing libraries (e.g. PyYAML — already present via
Ansible) is a better fit in general: weigh the parsing-correctness win
against losing zero-setup portability. Decide a clear rule and record it.
`repo-scan.py`, `capacity-scan.py`, `friction-scan.py`) are stdlib-only by
convention, for run-anywhere portability (control node, CI, bare clone, no venv).
Reevaluate whether selectively allowing libraries (e.g. PyYAML — already present via
Ansible) is a better fit in general: weigh the parsing-correctness win against losing
zero-setup portability. Decide a clear rule and record it.
15. **Security hardening implementation** — build out the ADR-002 hardening standard.
1. Implement the CIS Debian Benchmark **Level 1 + Level 2** in the `base` role
@ -123,6 +128,7 @@
6. Supply-chain hygiene: enforce tiered image pinning (stateful `tag@digest`;
stateless rolling tags — ADR-011) + official/verified images via the service
checklist; revisit active scanning (Trivy/Grype) once a triage stack exists (R1).
7. Is our network setup as it should be? I am not sure if all traffic between ubongo and notes goes via askari? what if askari breaks - will the rest work?
16. **ADR-011 (update management) — resolve open questions + accept.** Committed as
**Proposed**; resolve before marking Accepted:
@ -136,7 +142,4 @@
Friday timing enough at this scale?
6. Notification/control channel — boma's own ntfy topics (ADR-013) + a "skip this
week" / "pause" switch (ties to TODO 9).
7. ~~Reconcile pinning conflict (tags vs digests).~~ DECIDED: tiered rule —
**stateful `tag@digest`** (readable tag + integrity digest), **stateless
rolling tags**. Aligned across ADR-011 (dec. 2), ADR-004, ADR-002 supply-chain
row + accepted-risk R1, the service checklist, and 15.6.
7. ~~Reconcile pinning conflict (tags vs digests).~~ → DECIDED: tiered (stateful `tag@digest`, stateless rolling); ADR-011 dec. 2 / ADR-004 / ADR-002.

View file

@ -0,0 +1,38 @@
# Per-service operational-access record — template
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
It is the per-service **operational-access record**: every documented, verifiable way in
for troubleshooting. The structured parts are **rendered from the role's `access__*`
data** (the single source of truth that also drives `/check-access`) — keep the data
authoritative and regenerate this file rather than hand-editing the tables. The prose
"Operational notes" tail is hand-written.
Delete this preamble in the copy and start from the heading below.
---
# Access — <service>
## Access paths
The documented ways in, by tier (rendered from `access__*`):
| Tier | Path | Invocation |
|---|---|---|
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
## Break-glass
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
## Operational notes
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
- <none yet>

View file

@ -0,0 +1,44 @@
# Per-service backup record — template
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
role (ADR-022). It is the per-service **backup record**: what state the service holds,
how it is captured consistently, and how it is restored. The structured parts are
**rendered from the role's `backup__*` data** (the single source of truth that also
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
`backup__state: false` with a reason in its role defaults instead.
Delete this preamble in the copy and start from the heading below.
---
# Backup — <service>
## State captured
Rendered from `backup__*`:
| What | Source | How captured |
|---|---|---|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
| database | `<backup__dumps[*].cmd>``<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
- **Quiesce:** `<backup__quiesce>``true` means the service is stopped → backed up →
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
## Restore procedure
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
3. Replay each `<backup__dumps[*].dest>` into its database.
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
## Restore notes
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
known-tricky migrations.
- <none yet>

View file

@ -1,5 +1,9 @@
# ADR-001 — Architecture overview
## Status
Accepted (2026-05-30)
## Context
This document describes the overall architecture of the homelab infrastructure
@ -10,15 +14,16 @@ and the boundaries of what this Ansible monorepo manages.
- **Hypervisor**: Proxmox cluster (2+ nodes)
- **Guest OS**: Debian 13 (all managed hosts)
- **Scale**: 25 VMs, small fleet — treated as individuals, not cattle
- **Control node**: A dedicated Debian 13 VM on the cluster. Ansible runs from here.
The control node is the one host that cannot fully bootstrap itself from scratch
and requires manual initial setup (see `docs/runbooks/new-host.md`).
- **Control node**: `ubongo` — a dedicated always-on **physical** x86-64 machine
**outside** the cluster. Ansible runs from here. It cannot be created by the
Terraform it hosts, so it is provisioned manually (see ADR-015 and
`docs/runbooks/new-host.md`).
## What this repo manages
| Layer | Managed by | Notes |
|--------------------|--------------------|--------------------------------------------|
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; control node is the one manual exception (see ADR-009) |
| VM existence | Terraform (`terraform/`) | Clones the cloud-init template; `ubongo` (control node) is a physical box outside the cluster, the one manual exception (see ADR-009/ADR-015) |
| Internal DNS records | Ansible `dns` role | Internal zone rendered from inventory (see ADR-007/009) |
| OS baseline | Ansible `base` role | Users, SSH, firewall, updates, audit |
| Docker runtime | Ansible `docker_host` role | Engine, daemon config, log driver |
@ -32,14 +37,17 @@ describes the *intended* design — see STATUS.md for what is actually built.
```
all
├── control # the control node itself — baseline config only, runs no services
├── control # ubongo — physical control node outside the cluster; baseline config only, runs no services
├── docker_hosts # VMs running Docker services (most hosts)
└── proxmox_hosts # Proxmox nodes themselves (limited management scope)
├── proxmox_hosts # Proxmox nodes themselves (limited management scope)
└── offsite_hosts # askari (off-site Hetzner) — NetBird coordinator + external watchdog
```
The `control` group holds the single manually-provisioned control node; it is
managed for baseline config (SSH, firewall, updates) but never runs the
`docker_host` role. Proxmox nodes are managed only for basic baseline tasks (SSH).
`docker_host` role. The `offsite_hosts` group holds `askari`, the off-site Hetzner
host — also manually provisioned (ADR-016), managed for baseline config plus the
`netbird_coordinator` service role. Proxmox nodes are managed only for basic baseline tasks (SSH).
Proxmox configuration itself (storage, clustering, networking)
is out of scope.
@ -61,3 +69,21 @@ This architecture prioritises:
- **Simplicity**: few moving parts, no orchestration layer (no Kubernetes, no Swarm)
- **Reproducibility**: any host can be rebuilt from scratch via Ansible
- **Legibility**: a human reading the repo can understand what runs where
## Consequences
Drawn from the boundaries this ADR already states:
- The small fleet (25 VMs) is treated as individuals, not cattle (per Infrastructure),
and forgoing an orchestration layer is the cost of the simplicity priority (per
Decision).
- The control node `ubongo` cannot be created by the Terraform it hosts, so it is
provisioned manually — the one documented exception to Terraform-owned VM existence
(per Infrastructure / Host groups; ADR-009, ADR-015).
- Management scope is deliberately bounded: Proxmox configuration itself (storage,
clustering, networking) is out of scope, and the `control` group never runs the
`docker_host` role (per Host groups).
- Compose files are always regenerated by Ansible on deploy; no hand-edited Compose
files exist on hosts (per Service interaction model).
- The "What this repo manages" table describes the *intended* design — STATUS.md
records what is actually built (per that section).

View file

@ -1,5 +1,9 @@
# ADR-002 — Security baseline and strategy
## Status
Accepted (2026-05-30)
## Context
Security here is not a single control but the sum of several combined efforts —
@ -75,7 +79,8 @@ time. Each heading tags the threat(s) it primarily serves.
### Updates — *opportunistic*
- `unattended-upgrades` enabled for **security patches only**
- Full system upgrades triggered deliberately via Ansible (`make deploy PLAYBOOK=upgrade`)
- Full system upgrades triggered deliberately via Ansible (planned — a dedicated upgrade
playbook per ADR-011; not yet built, no `upgrade.yml` exists today)
- No automatic reboots — reboots are a conscious operational decision
### Minimal attack surface — *opportunistic, blast radius*
@ -87,7 +92,9 @@ time. Each heading tags the threat(s) it primarily serves.
### Audit trail — *agent error, blast radius*
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location if a log aggregation service is available
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
trail survives host (and full-cluster) compromise (ADR-018)
### Mandatory access control — *blast radius*
@ -102,8 +109,9 @@ time. Each heading tags the threat(s) it primarily serves.
- **AIDE** file-integrity monitoring (required by the CIS Debian benchmark) — detects
unexpected changes to system files
- **Network IDS** — Suricata on OPNsense (planned; see STATUS.md / TODO)
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
log-source-silence (a host that stops shipping) — into Grafana alerting on the
Loki/Grafana stack (ADR-018; planned)
## Secrets management — *agent error, opportunistic*
@ -180,3 +188,27 @@ This posture was chosen to be:
Out-of-scope items and conscious trade-offs are recorded in
`docs/security/accepted-risks.md` rather than here, so this decision record stays
stable while the risk posture evolves.
## Consequences
Drawn from the trade-offs, scoping, and follow-on work this ADR already states:
- Targeted/physical adversaries are out of scope at this scale, and supply chain is
consciously deprioritized — active vuln scanning is deferred as an accepted risk
(per Threat model; `docs/security/accepted-risks.md`).
- SELinux is not used (non-native to Debian, redundant with AppArmor), recorded as an
accepted risk (per Mandatory access control).
- Some CIS L2 items require separate partitions with restrictive mount options, which
reaches into VM disk layout — a provisioning concern (Terraform / cloud-init, ADR-006),
not just the `base` role (per Hardening standard). Any impractical CIS item is exempted
into the accepted-risk register with rationale, recording named exceptions rather than a
blanket opt-out.
- Several controls and governance mechanisms are stated as planned, not yet built:
Suricata network IDS, active alerting wiring AIDE/`auditd`/`fail2ban`/Suricata plus
log-source-silence into Grafana, the `/security-review` skill and its aggregation of
every `roles/*/SECURITY.md`, and the periodic security review (per File integrity /
Governance; STATUS.md / `docs/TODO.md`).
- The per-service security bar is enforced manually in review today, pending the planned
`/security-review` automation (per Governance).
- The accepted-risk register is kept out of this ADR so the record stays stable while the
risk posture evolves (per Decision; `docs/security/accepted-risks.md`).

View file

@ -1,6 +1,20 @@
# ADR-003 — Toolchain decisions
## Execution engine
## Status
Accepted (2026-05-30)
## Context
boma needs a defined, reproducible toolchain for running and testing its Ansible
monorepo: an execution engine, a Python environment, secrets handling, a testing
framework, linting, CI/CD, developer-ergonomics conventions, and a collections/roles
policy. This ADR records the choice made for each, together with the alternatives
weighed and why they were not adopted.
## Decision
### Execution engine
**Choice**: `ansible-core` (pip-installed, pinned version) + explicit `requirements.yml`
@ -12,7 +26,7 @@ that isn't needed in a maintained monorepo.
---
## Python environment
### Python environment
**Choice**: `python3-venv` (system Python on Debian 13) + pinned `requirements.txt`
@ -24,7 +38,7 @@ reproducible, and has no extra dependencies.
---
## Secrets
### Secrets
**Choice**: Ansible Vault (file-based, built-in)
@ -40,7 +54,7 @@ CLAUDE.md → Secrets).
---
## Testing
### Testing
**Choice**: Molecule with Docker driver (`molecule-plugins[docker]`)
@ -59,7 +73,7 @@ are needed.
---
## Linting
### Linting
**Choice**: `ansible-lint` + `yamllint` + `pre-commit`
@ -71,7 +85,7 @@ Config files: `.ansible-lint`, `.yamllint` in repo root.
---
## CI/CD
### CI/CD
**Choice**: Forgejo Actions (self-hosted at forgejo.nyumbani.baobab.band) + `act_runner`
@ -82,11 +96,12 @@ Config files: `.ansible-lint`, `.yamllint` in repo root.
2. On green → deploy to staging
3. [manual promote gate] → deploy to production
`act_runner` runs as a Docker container on the control node or a dedicated runner VM.
`act_runner` runs as a Docker container on `ubongo` (the control node — ADR-015), or on
a dedicated runner VM later if CI load warrants a separate host.
---
## Developer ergonomics
### Developer ergonomics
**Choice**: `Makefile` as the single interface for all operations
@ -101,7 +116,7 @@ The venv is activated in the user's shell profile.
---
## Collections and roles policy
### Collections and roles policy
**No Galaxy roles.** All roles are written and maintained locally in `roles/`.
Galaxy roles introduce external state, versioning surprises, and implicit
@ -135,3 +150,24 @@ are removed. Each entry in `requirements.yml` must justify its presence.
| NixOS targets | Poor Ansible fit; all hosts standardised on Debian 13 |
Terraform is **adopted** for VM provisioning only (no DNS) — see `docs/decisions/006-terraform.md`.
## Consequences
Drawn from the rationale and trade-offs this ADR already states:
- Pinning `ansible-core` + an explicit `requirements.yml` and a plain pinned venv keeps
the control-node environment small and fully reproducible, at the cost of maintaining
the pins (per Execution engine / Python environment).
- Ansible Vault's whole-file encryption makes diffs unreadable regardless of layout, so
secrets are organised for human lookup (`vault.<service>.<key>`) rather than diff
ergonomics — the trade accepted against SOPS/age (per Secrets).
- The `Makefile` is the single interface: Claude Code and CI invoke the same targets, so
local and CI behaviour can't drift and collaborators need not know raw flags (per
Developer ergonomics).
- Collections are added only on demand, so `requirements.yml` stays minimal; this defers
`community.crypto` (use `openssl` CLI until a role needs certs) and `community.general`
(add only the specific sub-module needed) until a real need appears (per Collections
and roles policy).
- The heavier orchestration tools were declined for this scale, each with a named
revisit trigger — e.g. Semaphore if non-SSH operators must trigger runs, AWX-adjacent
tooling only if AWX/AAP is ever adopted (per "What was explicitly ruled out").

View file

@ -1,5 +1,9 @@
# ADR-004 — Docker and Compose service model
## Status
Accepted (2026-05-30)
## Context
All services run as Docker containers managed via Docker Compose. This document
@ -42,8 +46,18 @@ below). Each service role contains a standard set of files:
| `defaults/main.yml` | Tuneables, `rolename__` namespace |
| `README.md` | Purpose, variables, usage (role convention) |
| `SECURITY.md` | Per-service security record — see ADR-002 and `docs/security/service-security-template.md` |
| `VERIFY.md` | Per-service UI acceptance spec — see ADR-008 Level 4 / ADR-017 and `docs/testing/service-verify-template.md` |
| `ACCESS.md` | Per-service operational-access record — see ADR-021 and `docs/access/service-access-template.md` |
| `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
The `access__*` (ADR-021) and `backup__*` (ADR-022) data in `defaults/main.yml` are
**cross-role conventions** — shared field names that deliberately do *not* carry the
`<rolename>__` prefix. ansible-lint's `var-naming[no-role-prefix]` has no per-prefix
allowlist, so each such line carries a trailing `# noqa: var-naming[no-role-prefix]` (the
rule stays enforced for genuinely role-scoped vars). `make new-role` scaffolds a reminder;
`roles/reverse_proxy/defaults/main.yml` is the reference.
### Standard deploy mechanics
Every service role's `tasks/main.yml` follows the same sequence, so all roles are
@ -97,7 +111,9 @@ Managed by the `docker_host` role. Key settings:
- Bind mounts preferred over named volumes for data that must be backed up
- All bind mount paths are under `/opt/services/<name>/data/`
- Backup strategy is defined separately (not in scope of this repo)
- Backup strategy is defined in **ADR-022** — the bind mounts under
`/opt/services/<name>/data/` are exactly the unit ADR-022's per-service `backup__*`
contract (and `BACKUP.md`) captures
## Decision
@ -106,3 +122,23 @@ Docker Compose was chosen over Kubernetes/Swarm because:
- Compose files are human-readable and easily auditable
- No distributed state to manage
- Straightforward to back up and restore
## Consequences
Drawn from the trade-offs and deferred items this ADR already states:
- A shared `compose_service` engine role is intentionally not built: the ~5 standard
tasks are duplicated per role in favour of legible, self-contained roles, with a stated
revisit trigger — extract a shared engine if maintaining the duplicated mechanics
becomes painful (a pattern change touching many roles, or drift this standard alone
isn't preventing) (per "Why not a shared engine").
- Forgoing Kubernetes/Swarm is the deliberate cost of matching complexity to a 25 host
fleet with no distributed state to manage (per Decision).
- User-namespace remapping is not enabled by default — evaluated per use case (per Docker
daemon configuration).
- Bare `latest` is acceptable only on the stateless tier; the stateful tier is always
pinned `tag@digest`, and image updates are a deliberate operation (per Image management;
ADR-011).
- Backup strategy is defined in ADR-022 (not in this ADR); the persistent bind mounts
under `/opt/services/<name>/data/` are the unit ADR-022's per-service `backup__*`
contract captures (per Persistent data).

View file

@ -1,13 +1,17 @@
# ADR-005 — Host bootstrapping
## Status
Accepted (2026-05-30)
## Context
This document defines the **cloud-init template** that managed VMs are cloned
from, and the **control-node** bootstrapping special case. The per-host
provisioning pipeline — how a VM is created from this template and handed off to
Ansible — is owned by ADR-009. Terraform clones the template defined here; the
template is the base image both for Terraform-managed hosts and for the manually
provisioned control node.
template is the base image for Terraform-managed hosts. The control node (`ubongo`)
is a physical machine installed directly, not cloned from this template (ADR-015).
## Approach: Proxmox cloud-init template
@ -32,10 +36,10 @@ High-level steps:
## VM provisioning (per new host)
Per-host VMs are created by **Terraform**, which clones this template, sets the
cloud-init values (hostname, SSH public key, IP/gateway), and writes the host's
DNS A record. Cloud-init runs at first boot (~3060 seconds), leaving the VM
reachable via SSH with the ansible user's key.
Per-host VMs are created by **Terraform**, which clones this template and sets the
cloud-init values (hostname, SSH public key, IP/gateway). Cloud-init runs at first
boot (~3060 seconds), leaving the VM reachable via SSH with the ansible user's key.
Terraform writes no DNS records — the `dns` role owns the internal zone (ADR-009).
The full create → inventory → configure pipeline, and the Terraform↔Ansible data
contract, are defined in **ADR-009 (provisioning handoff)**. There is no manual
@ -51,11 +55,12 @@ for the end-to-end commands and `docs/runbooks/new-host.md` for the full procedu
## Control node bootstrapping
The control node is a special case — it runs Terraform and Ansible, so it cannot
be created by the Terraform it hosts (chicken-and-egg). It is the one documented
exception to Terraform-owned VM existence (see ADR-009). The control node requires:
be created by the Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated
**physical** machine outside the cluster, and the one documented exception to
Terraform-owned VM existence (see ADR-009 and ADR-015). The control node requires:
1. Manual VM provisioning — clone this cloud-init template by hand (Proxmox UI or
`qm clone`), since Terraform is not yet available to do it
1. Manual OS provisioning — install Debian 13 on the physical box by hand (it is not
a Proxmox guest, so there is no template to clone)
2. Manual setup of the Ansible environment:
```bash
git clone <repo> ~/ansible
@ -68,9 +73,10 @@ exception to Terraform-owned VM existence (see ADR-009). The control node requir
```
3. After that, the control node can manage all other hosts normally
The control node itself is listed in `inventories/production/hosts.yml` under
a `control` group and can be managed for baseline config (SSH, firewall, updates)
but not for the `docker_host` role (it does not run services).
`ubongo` is listed in `inventories/production/hosts.yml` under the `control` group
and can be managed for baseline config (SSH, firewall, updates) but not for the
`docker_host` role (it does not run services). Hardware target and recovery model
are in ADR-015.
## Decision
@ -79,3 +85,19 @@ Cloud-init with Proxmox templates provides:
- No manual installer interaction
- A clean handoff point to Ansible
- Easy rebuilds — destroy VM, clone template, run Ansible
## Consequences
Drawn from the trade-offs and special cases this ADR already states:
- The cloud-init image was chosen over a manual Debian installer (slow, error-prone,
not reproducible) and over preseed/netboot (powerful but complex to maintain) (per
Approach).
- Template creation is a one-time manual procedure per Proxmox cluster, and the template
is never booted directly (per Template creation).
- There is no manual `qm clone` path for managed hosts; the full create → inventory →
configure pipeline and the Terraform↔Ansible contract live in ADR-009 (per VM
provisioning / Ansible handoff).
- The control node is the sole documented exception — `ubongo`, a physical machine
installed by hand because it cannot be created by the Terraform it hosts (chicken-and-egg);
its hardware target and recovery model live in ADR-015 (per Control node bootstrapping).

View file

@ -1,10 +1,14 @@
# ADR-006 — Terraform for infrastructure provisioning
## Status
Accepted (2026-05-30)
## Context
Ansible manages host configuration well but has no state model for infrastructure
existence. Adding Terraform handles the "what exists" layer — creating and destroying
VMs on Proxmox — while Ansible continues to own everything that runs inside them,
VMs on Proxmox and Hetzner — while Ansible continues to own everything that runs inside them,
including all internal DNS records.
This complements rather than replaces Ansible. The two tools do not overlap. The
@ -13,7 +17,9 @@ exact boundary, handoff pipeline, and data contract between them live in **ADR-0
---
## Responsibility split
## Decision
### Responsibility split
The canonical responsibility-split table lives in **ADR-009**. In short: Terraform
owns VM existence only; Ansible owns everything inside a VM, including all internal
@ -26,11 +32,16 @@ cadence, making them a poor fit for Terraform state.
---
## Providers
### Providers
**`bpg/proxmox` (`~> 0.70`)**: Chosen over `telmate/proxmox` for active maintenance,
full Proxmox 8 API support, and better cloud-init integration. This is the only
provider.
full Proxmox 8 API support, and better cloud-init integration. This is the provider
for Proxmox VMs.
**`hetznercloud/hcloud` (`~> 1.65`)**: owns off-site VM existence (`askari`). ADR-006's
scope is now **Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across
providers. The `offsite` environment + `hetzner_vm` module live alongside the Proxmox env
+ `proxmox_vm` module; each environment has its own local state.
Terraform does **not** manage DNS. An earlier design used `hashicorp/dns` (RFC 2136)
to write A records, but that created a bootstrap cycle — the first DNS server cannot
@ -42,7 +53,7 @@ Terraform manages its own provider dependencies via `required_providers` and
---
## State backend
### State backend
**Choice**: Local state on the control node.
@ -59,15 +70,17 @@ integration boundary.
---
## Structure
### Structure
```
terraform/
modules/
proxmox_vm/ # reusable VM module — Proxmox only, no DNS
hetzner_vm/ # reusable VM module — Hetzner Cloud, no DNS
environments/
staging/ # staging VMs, separate state file
production/ # production VMs, separate state file
staging/ # staging Proxmox VMs, separate state file
production/ # production Proxmox VMs, separate state file
offsite/ # off-site Hetzner VMs (askari), separate state file
```
Separate environment directories (not Terraform workspaces) for the clearest
@ -75,7 +88,7 @@ isolation — no risk of accidentally applying the wrong state.
Each environment directory contains:
- `providers.tf` — provider version pins and configuration
- `backend.tf`Forgejo state backend (environment-specific path)
- `backend.tf`backend configuration (local state on the control node; no remote backend — see "State backend" above)
- `variables.tf` — input declarations
- `terraform.tfvars.example` — tracked template; copy to `terraform.tfvars` for actual values
- `main.tf``local.vms` map and module calls (no DNS resources)
@ -83,7 +96,7 @@ Each environment directory contains:
---
## Secrets handling
### Secrets handling
The only secret input (the Proxmox API token) is passed via a `TF_VAR_*`
environment variable and declared `sensitive = true` in `variables.tf`. It never
@ -92,7 +105,7 @@ appears in `.tfvars` files. Non-secret configuration lives in tracked
---
## Ansible integration
### Ansible integration
After `terraform apply`, run `make tf-inventory TF_ENV=<env>` to regenerate
`inventories/<env>/hosts.yml` from the `vms` output. The full handoff pipeline,
@ -102,7 +115,7 @@ handoff)**.
---
## What was ruled out
### What was ruled out
| Option | Reason |
|---|---|
@ -110,3 +123,26 @@ handoff)**.
| OPNsense Terraform provider | Community-maintained; provider rot risk across OPNsense releases |
| Terraform workspaces | Single state file with workspace prefix; accidental cross-env apply possible |
| Separate Terraform repo | Cross-referencing between infra and config adds friction; monorepo keeps the full picture together |
## Consequences
Drawn from the "What was ruled out" section and the decisions stated above:
- `bpg/proxmox` is the provider for Proxmox VMs; `telmate/proxmox` was ruled out for weaker
maintenance and Proxmox 8 / cloud-init support (Providers; What was ruled out).
- `hetznercloud/hcloud` is the provider for off-site VM existence (`askari`); ADR-006's
scope now covers Proxmox + Hetzner (Providers).
- OPNsense stays entirely in Ansible — no Terraform OPNsense provider — to avoid
community-provider rot across OPNsense releases (Responsibility split; What was
ruled out).
- Terraform writes no DNS records; Ansible's `dns` role owns the entire internal
zone, avoiding the bootstrap cycle and split DNS ownership the earlier
`hashicorp/dns` design created (Providers).
- State is local on the control node because Forgejo offers no usable HTTP state
backend; this is sufficient at solo-operator scale (no concurrent applies, no
remote locking), with a real backend such as MinIO/S3 to be added later if
warranted (State backend).
- Separate environment directories are used instead of Terraform workspaces to
remove the risk of applying the wrong state (Structure; What was ruled out).
- Terraform and Ansible internals are kept in one monorepo rather than a separate
Terraform repo to avoid cross-referencing friction (What was ruled out).

View file

@ -1,5 +1,9 @@
# ADR-007 — Network topology and addressing
## Status
Accepted (2026-05-30)
## Context
The boma homelab is a Proxmox cluster on a dedicated private network behind an
@ -10,7 +14,9 @@ and OPNsense configuration.
---
## Physical topology
## Decision
### Physical topology
```
ISP
@ -38,7 +44,7 @@ ISP
---
## VLAN design
### VLAN design
| VLAN | Name | Subnet | Purpose |
|---|---|---|---|
@ -47,13 +53,13 @@ ISP
| 30 | `lan` | `10.30.0.0/24` | Trusted home devices. DHCP. Access to selected `srv` services via OPNsense. |
| 40 | `iot` | `10.40.0.0/24` | Smart home, cameras, printers. DHCP. Internet egress only + HA exception. |
| 50 | `guest` | `10.50.0.0/24` | Guest WiFi. DHCP. Internet only, fully isolated. |
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
---
## IP addressing
### IP addressing
### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
#### VLAN 10 — mgmt (10.10.0.0/24) — no DHCP
| Address | Host |
|---|---|
@ -63,7 +69,7 @@ ISP
| `10.10.0.201` | `pve1` |
| `10.10.0.202` | `pve2` |
### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
#### VLAN 20 — srv (10.20.0.0/24) — no DHCP, all static
| Range | Purpose |
|---|---|
@ -81,36 +87,45 @@ Assigned infrastructure addresses:
| `10.20.0.12` | `proxy` | Reverse proxy |
| `10.20.0.13` | `homeassistant` | Home Assistant (IoT controller) |
### VLAN 30 — lan (10.30.0.0/24)
> **Control node `ubongo` — legacy V4 network (transitional).** `ubongo` (ADR-015) is the
> manually-provisioned physical control node and currently lives on the **legacy V4
> homelab network at `10.20.10.151`** — boma is being built up from the V4 base, and the
> physical LAN has not yet been re-cut to this VLAN scheme. That address is therefore
> **outside** the planned `srv` `10.20.0.0/24`; `base__firewall_control_addr` and the
> inventory point at the real (V4) address. When the network is migrated to these VLANs,
> `ubongo` moves into `mgmt`/`srv` and this note is retired.
#### VLAN 30 — lan (10.30.0.0/24)
| Range | Purpose |
|---|---|
| `10.30.0.1` | OPNsense gateway |
| `10.30.0.100``.249` | DHCP pool |
### VLAN 40 — iot (10.40.0.0/24)
#### VLAN 40 — iot (10.40.0.0/24)
| Range | Purpose |
|---|---|
| `10.40.0.1` | OPNsense gateway |
| `10.40.0.100``.249` | DHCP pool |
### VLAN 50 — guest (10.50.0.0/24)
#### VLAN 50 — guest (10.50.0.0/24)
| Range | Purpose |
|---|---|
| `10.50.0.1` | OPNsense gateway |
| `10.50.0.100``.249` | DHCP pool |
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
#### VLAN 99 — vpn — retired
| Address | Host |
|---|---|
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
| `10.99.0.2` | `askari` (Hetzner VPS) |
| `10.99.0.10`+ | Road-warrior clients |
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
`10.99.0.0/24` is freed.
### Corosync ring (172.16.0.0/24) — not on managed switch
#### Corosync ring (172.16.0.0/24) — not on managed switch
| Address | Host |
|---|---|
@ -120,7 +135,7 @@ Assigned infrastructure addresses:
---
## OPNsense firewall rules (intent)
### OPNsense firewall rules (intent)
| Source | Destination | Policy |
|---|---|---|
@ -132,8 +147,8 @@ Assigned infrastructure addresses:
| `iot` | internet | allow egress only |
| `iot` | `srv` (HA IP only) | allow on integration ports |
| `guest` | internet | allow, isolated from all internal |
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
| `vpn` | `mgmt` | allow (administration from askari) |
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
**Home Assistant ↔ IoT**: HA VM at `10.20.0.13` can reach IoT VLAN on required
ports. OPNsense Avahi (mDNS reflector) bridges `srv``iot` for device discovery.
@ -141,7 +156,7 @@ IoT devices cannot initiate connections to `srv`.
---
## Naming scheme
### Naming scheme
| Layer | Convention | Examples |
|---|---|---|
@ -150,37 +165,74 @@ IoT devices cannot initiate connections to `srv`.
| Infrastructure VMs | `<role><n>` | `dns1`, `dns2`, `proxy` |
| Hetzner VPS | `askari` | Swahili for guard/sentinel |
| Internal FQDN | `<host>.boma.baobab.band` | `dns1.boma.baobab.band` |
| Public service FQDN | `<service>.baobab.band` | `forgejo.nyumbani.baobab.band` |
| Public service FQDN | `<service>.wingu.me` | `vaultwarden.wingu.me` |
| Off-site (VPS) FQDN | `<service>.askari.wingu.me` | `netbird.askari.wingu.me` |
---
## DNS zones and split-horizon
### DNS zones and split-horizon
**Internal zone**: `boma.baobab.band` — served by `dns1` and `dns2`.
**Internal zone**: `boma.baobab.band` **today** (the `dns` role is unbuilt) — served by
`dns1` and `dns2`. **Target:** it is renamed to `boma.wingu.me` in Phase 2 when the `dns`
role lands. Until then `boma.baobab.band` is the authoritative internal name **everywhere
it appears** (the naming table above, split-horizon below, the OPNsense forwarder, and
ADR-009/016). This is the single source for that transition; other references use the
current name and inherit this caveat.
The zone is rendered by the Ansible `dns` role: host A records come from the
inventory (which derives from Terraform's `local.vms` via `make tf-inventory`),
and service/alias/split-horizon records are explicit zone data in `group_vars`.
Terraform itself writes no DNS records — see ADR-009.
**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent).
Public-facing services resolve to the public IP or Cloudflare proxy.
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal — the
Phase-2 target; currently `boma.baobab.band`, see *Internal zone* above), services
`<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
(reached over LAN or the NetBird mesh); only deliberate exceptions are published. The
project is `boma`; the domain is `wingu.me`. The legacy `baobab.band` zone (Cloudflare)
is out of scope here.
**Split-horizon**: `dns1`/`dns2` serve internal answers for any hostname that has
both a public and private face. Example: `forgejo.nyumbani.baobab.band` resolves to
`10.20.0.12` (proxy) internally and to the public IP externally.
both a public and private face. Example: `vaultwarden.wingu.me` resolves to
`10.20.0.12` (proxy) internally and to the public IP externally (the internal
zone will be renamed to `boma.wingu.me` when the `dns` role is built — Phase 2).
OPNsense DNS resolver forwards `boma.baobab.band` queries to `dns1`/`dns2`.
All other queries go upstream (e.g., `1.1.1.1`, `9.9.9.9`).
---
## External monitoring — askari
### External monitoring — askari
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
for administration.
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
`askari` is provisioned and managed independently of the Proxmox cluster — it
must be reachable even when the homelab is down (its entire purpose).
FQDN: `askari.baobab.band`.
`askari` is provisioned as **Terraform IaC** (`hetznercloud/hcloud`), managed
independently of the Proxmox cluster (its own provider + local state in
`terraform/environments/offsite/`). It must be reachable even when the homelab is down
(its entire purpose), which is also why the mesh coordinator lives here: an off-site
control plane survives a homelab outage.
FQDN: `askari.wingu.me` (off-site tier; record added by `public_dns` when askari exists — M2/M4).
---
## Consequences
Drawn from the implications already stated above:
- VLAN 99 (`vpn`, `10.99.0.0/24`) is retired and the subnet freed; remote access is
carried by the self-hosted NetBird mesh instead of an OPNsense WireGuard subnet
(VLAN design; IP addressing — VLAN 99 retired).
- Mesh-peer firewall allowances (to `srv` metrics ports and `mgmt`) are enforced by
NetBird ACLs, not OPNsense rules (OPNsense firewall rules (intent)).
- IoT devices cannot initiate connections to `srv`; only Home Assistant at
`10.20.0.13` may reach the IoT VLAN, with OPNsense Avahi bridging `srv``iot`
for discovery (OPNsense firewall rules (intent)).
- Terraform writes no DNS records; the Ansible `dns` role renders the internal zone
from inventory plus `group_vars`, with `dns1`/`dns2` serving split-horizon answers
(DNS zones and split-horizon).
- `askari` runs independently of the cluster so it survives a homelab outage, which
is why the off-site NetBird control plane lives there (External monitoring —
askari).

View file

@ -1,5 +1,12 @@
# ADR-008 — Testing methodology
> Practical point-of-use pitfalls (nft render checks, Molecule `community.docker`,
> apply-path coverage blind spots) live in `docs/testing/gotchas.md`.
## Status
Accepted (2026-05-30)
## Context
Ansible roles must be idempotent and correct before they touch production hosts.
@ -8,11 +15,13 @@ This document records the testing strategy, what each level covers, and — crit
---
## Three testing levels
## Decision
### Level 1 — Molecule (per role, always required)
### Three testing levels
Runs in Docker on the control node or in CI. Fast (~5 min per role).
#### Level 1 — Molecule (per role, always required)
Runs in Docker on the control node (`ubongo`) or in CI. Fast (~5 min per role).
**What happens during `molecule test`:**
1. `create` — start the test container
@ -38,7 +47,7 @@ The idempotency step is non-negotiable. Every role must pass it cleanly.
that: svc.stdout == "active"
```
### Level 2 — Staging playbook (full stack, real VMs)
#### Level 2 — Staging playbook (full stack, real VMs)
`make check PLAYBOOK=site` followed by `make deploy PLAYBOOK=site` on
Terraform-provisioned staging VMs. Catches inter-role dependencies and ordering
@ -47,15 +56,35 @@ have already run and configured the firewall).
Run before every merge to `main`.
### Level 3 — External smoke test from askari
#### Level 3 — External smoke test from askari
Once `askari` is operational: scripted checks from outside the network confirming
that public-facing services respond correctly. Catches firewall and reverse proxy
configuration issues invisible to Ansible check mode.
#### Level 4 — Service-UI acceptance (Claude-driven exploratory)
A Claude-driven exploratory check of a service's **application UI**, run as
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
`playwright` plugin against a **staging** deploy, authenticates through the real
Caddy (ADR-024) + Authentik SSO flow using a test user in the staging `test` group, then
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything
it can't verify (hardware, paid/external flows, subjective judgment).
Catches application-level regressions no lower level sees ("does PhotoPrism actually
serve photos?"). Placement: after Level 2 (staging deploy), before production
promotion. Exploratory and interactive by design — *not* a deterministic CI/cron gate
(that role belongs to health checks / Uptime Kuma).
**Status:** the skill, the `VERIFY.md` template, and standards are authorable now;
running it is deferred on `ubongo` + the `playwright` plugin + Authentik + a staging
deploy (STATUS.md). Full design: ADR-017.
---
## Molecule test image
### Molecule test image
**No external images.** The project builds and hosts its own test image.
@ -80,7 +109,7 @@ functionally equivalent and fully owned.
---
## Idempotency requirements
### Idempotency requirements
Every role task must satisfy one of these:
@ -98,9 +127,9 @@ catches anything lint misses.
---
## What Molecule tests — and what it does not
### What Molecule tests — and what it does not
### Tested in Molecule
#### Tested in Molecule
| Capability | Notes |
|---|---|
@ -116,7 +145,7 @@ catches anything lint misses.
| auditd installation and configuration | Install and config file |
| Idempotency of all of the above | Enforced by Molecule's idempotency step |
### Not tested in Molecule — explicit exceptions
#### Not tested in Molecule — explicit exceptions
The following require a real kernel or real hardware and are validated only at
Level 2 (staging) or Level 3 (external). This is a conscious, documented decision
@ -125,7 +154,8 @@ Level 2 (staging) or Level 3 (external). This is a conscious, documented decisio
| Capability | Reason not testable in Molecule |
|---|---|
| `nftables` rule loading | Requires `nf_tables` kernel module; not available in Docker |
| WireGuard tunnel establishment | Requires `wireguard` kernel module |
| **Reboot-survivability / host-firewall × Docker interaction / boot-ordering** | **Requires a real kernel reboot — the class that caused the 2026-06-17 mesh-hardening incident. Now covered by local VM integration testing (ADR-025).** |
| NetBird mesh data plane (`wt0` WireGuard interface) | Requires the `wireguard` kernel module; Molecule checks only that the agent is installed/configured (ADR-016) |
| `unattended-upgrades` behaviour | Installs correctly; actual upgrade behaviour requires a real apt environment |
| DHCP behaviour (OPNsense) | OPNsense is managed by Ansible but not testable in a container |
| mDNS reflector (Avahi cross-VLAN) | Requires real network interfaces and VLANs |
@ -136,9 +166,14 @@ For the above, Molecule tests only what it can: that the relevant packages are
installed, that configuration files render correctly, and that services are enabled.
Behavioural correctness is confirmed on staging.
**ADR-025 is the concrete build of Level 2/3** — local VM integration testing on
ubongo (libvirt/KVM, throwaway overlay VMs, stdlib-only driver). It specifically
targets the reboot-survivability / host-firewall × Docker / boot-ordering class that
Molecule structurally cannot reach. See `docs/decisions/025-local-vm-integration-testing.md`.
---
## CI pipeline
### CI pipeline
```
push to main
@ -155,3 +190,27 @@ promote to production
Manual gates are intentional. Automated tests prove correctness in isolation;
a human confirms the change is safe to promote.
---
## Consequences
Drawn from the limitations and trade-offs already stated above:
- The Molecule idempotency step is non-negotiable; every role must pass it cleanly
(Three testing levels — Level 1).
- A class of capabilities (nftables rule loading, NetBird mesh data plane,
unattended-upgrades behaviour, OPNsense DHCP, Avahi mDNS reflection, hardware
passthrough, corosync cluster formation) cannot be verified in Molecule and is
validated only at Level 2 (staging) or Level 3 (external) — a conscious,
documented decision, not a gap (What Molecule tests — and what it does not).
- The project builds and hosts its own `molecule-debian13` image rather than relying
on an external Docker Hub image (e.g. geerlingguy), accepting the maintenance of a
custom image to avoid drift, disappearance, or unexpected changes outside project
control (Molecule test image).
- Level 4 service-UI acceptance is authorable now but its execution is deferred,
pending `ubongo`, the `playwright` plugin, Authentik, and a staging deploy (Three
testing levels — Level 4).
- Promotion to staging and to production stays behind intentional manual approval
gates; automation proves isolated correctness, a human confirms promotion safety
(CI pipeline).

View file

@ -1,5 +1,9 @@
# ADR-009 — Terraform ↔ Ansible provisioning handoff
## Status
Accepted (2026-05-30)
## Context
Two tools touch every managed host. Terraform owns **what exists** — VMs on
@ -14,7 +18,9 @@ the cloud-init template that VMs are cloned from. This ADR covers how they conne
---
## The boundary
## Decision
### The boundary
| Layer | Tool | Notes |
|---|---|---|
@ -31,7 +37,7 @@ below).
---
## The handoff pipeline
### The handoff pipeline
There is one path by which a managed host comes into existence and reaches its
configured state:
@ -55,7 +61,7 @@ this pipeline — **never** by hand-editing the inventory.
---
## The data contract
### The data contract
The seam's interface is a single Terraform output consumed by a single script.
@ -75,7 +81,12 @@ The seam's interface is a single Terraform output consumed by a single script.
`terraform output -json` and writes `inventories/<env>/hosts.yml`. It validates the
group against the allowed set and fails loudly on an unknown group.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`.
**Valid groups**: `control`, `docker_hosts`, `proxmox_hosts`, `offsite_hosts`.
`control` holds `ubongo`, a physical machine not managed by Terraform (see the
control-node exception below and ADR-015). `offsite_hosts` holds `askari`, which is
Terraform-managed via the `hetznercloud/hcloud` provider in the `offsite` environment
(see the off-site handoff note below and ADR-016).
The generated `hosts.yml` carries a "do not edit manually" header and is owned by
the generator. Treat it as a build artifact: the source of truth is `local.vms` in
@ -83,7 +94,7 @@ Terraform, and the inventory is regenerated, never edited.
---
## Cloud-init's role
### Cloud-init's role
Cloud-init is the thin first-boot layer between Terraform and Ansible:
@ -98,7 +109,7 @@ The line is sharp: cloud-init buys *reachability*, Ansible owns *configuration*.
---
## Internal DNS — owned by Ansible, no chicken-and-egg
### Internal DNS — owned by Ansible, no chicken-and-egg
Terraform writes **no** DNS records. The internal zone (`boma.baobab.band`) is
rendered entirely by the Ansible `dns` role:
@ -108,7 +119,8 @@ rendered entirely by the Ansible `dns` role:
remains the ultimate source of truth for which hosts exist; the data simply flows
through the inventory instead of through a direct Terraform→DNS write.
- **Service, alias (CNAME), split-horizon, and non-VM records** (e.g. the OPNsense
gateway, `forgejo.nyumbani.baobab.band` → proxy) are explicit zone data in `group_vars`.
gateway, `vaultwarden.wingu.me` → proxy split-horizon) are explicit zone data in
`group_vars`.
This dissolves the bootstrap cycle that a Terraform-managed zone would create. If
Terraform wrote records via RFC 2136, provisioning the **first** DNS server would
@ -124,14 +136,16 @@ convention only — it no longer implies any difference in how records are writt
---
## The control-node exception
### The control-node exception
The control node — the host that runs Terraform and Ansible — is the one VM
Terraform does **not** create. It cannot provision the infrastructure that would
provision itself (chicken-and-egg). It is therefore the single documented exception
to "Terraform owns VM existence":
The control node — the host that runs Terraform and Ansible — is `ubongo`, a
dedicated **physical** machine outside the cluster. It is not a VM at all, so
Terraform genuinely never touches it: it cannot provision the infrastructure that
would provision itself (chicken-and-egg). It is therefore the single documented
exception to "Terraform owns VM existence":
- Provisioned and bootstrapped manually, per the control-node section of ADR-005.
- Provisioned and bootstrapped manually on bare metal, per the control-node section
of ADR-005; rationale, hardware, and recovery model in ADR-015.
- Listed in `inventories/<env>/hosts.yml` under the `control` group, and managed by
Ansible for baseline config only (no `docker_host` role).
@ -139,7 +153,28 @@ Every other host is Terraform-managed.
---
## What was ruled out
### The off-site handoff (`offsite` environment → `offsite_hosts`)
`askari` (Hetzner VPS, ADR-016) follows the same handoff pipeline as Proxmox hosts but
with its own provider and environment:
- **Producer**`terraform/environments/offsite/outputs.tf` emits a `vms` map in the
same `{ host: { ip, group } }` shape as Proxmox environments; `askari`'s group is
`offsite_hosts`.
- **Consumer**`scripts/tf_to_inventory.py` reads `terraform output -json` from the
`offsite` environment and writes `inventories/production/offsite.yml`.
- **Makefile target**`make tf-inventory-offsite` runs the generator for the offsite
environment.
The production inventory is a **directory** (`inventories/production/`) that Ansible
merges at runtime: `hosts.yml` (Proxmox-generated) and `offsite.yml`
(offsite-generated) together form the full production host list. Each file is a build
artifact — never hand-edited; their source of truth is `local.vms` in the respective
environment's `main.tf`.
---
### What was ruled out
| Option | Reason |
|---|---|
@ -147,3 +182,28 @@ Every other host is Terraform-managed.
| Hand-editing the generated inventory | `hosts.yml` is a build artifact of `tf_to_inventory.py`; edits are overwritten on the next `make tf-inventory`. Edit `local.vms` instead. |
| Documenting the seam in both ADR-005 and ADR-006 | The boundary belongs in exactly one place. Those ADRs link here. |
| Terraform-managed DNS records (`hashicorp/dns` + RFC 2136) | Created a bootstrap cycle (the first DNS server can't register itself) and split DNS ownership across two tools. Ansible owns the whole internal zone instead — one owner, no cycle. |
## Consequences
Drawn from the boundary, the data contract, and the "What was ruled out" section above:
- Adding a host means editing `local.vms` and running the handoff pipeline; the
generated `hosts.yml` is a build artifact and must never be hand-edited — manual
edits are overwritten on the next `make tf-inventory` (The handoff pipeline; The
data contract; What was ruled out).
- Manual `qm clone` is rejected as a general provisioning path so the inventory and
real infrastructure cannot drift; Terraform is the single way VMs come into
existence (What was ruled out).
- Terraform writes no DNS records: the Ansible `dns` role renders the whole internal
zone from inventory plus `group_vars`, dissolving the bootstrap cycle a
Terraform-managed zone (`hashicorp/dns` + RFC 2136) would create (Internal DNS —
owned by Ansible, no chicken-and-egg; What was ruled out).
- The control node (`ubongo`) is the single documented exception to "Terraform owns
VM existence" — a physical machine provisioned manually and managed by Ansible for
baseline config only (The control-node exception).
- The `offsite` TF environment's `vms` output feeds the `offsite_hosts` group via
`tf_to_inventory.py` (`make tf-inventory-offsite``inventories/production/offsite.yml`);
the production inventory is a directory that merges `hosts.yml` (Proxmox) and
`offsite.yml` (offsite) (The off-site handoff).
- The seam is documented in exactly one place (this ADR); ADR-005 and ADR-006 link
here rather than restating it (What was ruled out).

View file

@ -1,5 +1,9 @@
# ADR-010 — Forgejo integration and CI
## Status
Accepted (2026-05-30)
## Context
boma's git host, container registry, and (planned) CI all run on a self-hosted
@ -20,7 +24,7 @@ held to the same standard as the rest of the repo's secrets.
---
## Decisions
## Decision
### 1. API tokens are managed secrets, least-privilege
@ -63,8 +67,8 @@ Trunk-based, matching ADR-003 / ADR-008:
push to main → lint + Molecule → deploy staging → [manual gate] → deploy production
```
Runner: `act_runner` on the control node or a dedicated runner VM. Actions is not
yet enabled — see STATUS.md.
Runner: `act_runner` on `ubongo` (the control node — ADR-015), or a dedicated runner VM
later if CI load warrants a separate host. Actions is not yet enabled — see STATUS.md.
---
@ -75,3 +79,21 @@ yet enabled — see STATUS.md.
| Terraform Forgejo HTTP state backend | Forgejo's `/raw/` API is read-only; state can't be written there. Local state instead (ADR-006). |
| Admin-scoped automation tokens | Unnecessary privilege; scope to `read:repository` + `read`/`write:package`. |
| Ad-hoc UI/API configuration as the norm | Becomes undocumented drift; codify or document instead. |
---
## Consequences
- The planned CI pipeline (see "CI pipeline (planned)") is trunk-based per ADR-003 /
ADR-008 — `push to main → lint + Molecule → deploy staging → [manual gate] → deploy
production` — running `act_runner` on `ubongo` (or a dedicated runner VM later if CI
load warrants); Actions is not yet enabled, so this remains future work tracked in
STATUS.md.
- Terraform state is not held in Forgejo: its `/raw/` API is read-only and cannot be
written, so local state is used instead (ADR-006) (see "What was ruled out").
- Automation tokens are scoped to `read:repository` + `read`/`write:package` rather
than admin, accepting the limits that least-privilege imposes on what automation can
do (see "What was ruled out").
- Instance/repo configuration must be codified or documented rather than changed
ad-hoc, to avoid the undocumented drift `/review-repo` exists to catch (see "What was
ruled out").

View file

@ -1,6 +1,9 @@
# ADR-011 — Update and upgrade management
**Status: Proposed — draft for discussion (not yet accepted).**
## Status
Proposed (2026-06-04) — draft for discussion; not yet accepted. The core decisions
below are settled in intent, but several specifics remain open (see "Open questions").
## Context
@ -10,7 +13,7 @@ drift over time and must be kept current without breaking the homelab: the **hos
---
## Decisions
## Decision
### 1. Every service is classified stateful or stateless
@ -18,7 +21,7 @@ Each container role declares its class, e.g. `<role>__stateful: true|false` (def
`false`). The split is the load-bearing classification for the whole policy.
- **Stateless** — no durable data of its own; losing the container loses nothing.
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Traefik,
Rebuild = re-pull. Examples: the \*arr stack, Jellyfin, exporters, whoami, Caddy,
reverse proxies, FlareSolverr.
- **Stateful** — owns data, schema, or migrations: databases, and apps with their own
store/migrations (Nextcloud, Vaultwarden, Forgejo, PhotoPrism, Discourse, Snipe-IT).
@ -53,7 +56,7 @@ per host, in strict order with a verification gate between every phase:
5. **Verify** again; alert on failure.
**Host ordering:** infrastructure hosts (DNS, then reverse proxy) update and validate
**before** the rest follow — so a DNS/Traefik failure doesn't make every host look
**before** the rest follow — so a DNS/Caddy failure doesn't make every host look
broken at once and hide the real cause. Never reboot the whole fleet simultaneously.
### 4. Snapshot-before is the rollback mechanism
@ -64,8 +67,8 @@ Because these are primarily Proxmox VMs, take a **VM snapshot before the Friday
### 5. Stateful upgrades — 8-weekly analysis, human-gated, backup-first
Stateful services are **never** touched by the weekly run. Instead, **every 8 weeks**
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` plan and
ADR-010) does:
an automated analysis job (a scheduled `claude -p`, per the `scheduled_jobs` design in
`docs/TODO.md` 8.3, not yet built) does:
1. Read changelogs / breaking-change notes for each pinned stateful image; diff the
pinned tag against what's available.
@ -125,10 +128,26 @@ alert-driven.
| -------------------------------------- | ----------------------------------------------------------------------------- |
| One uniform policy for all services | Ignores blast radius; stateful data loss ≠ stateless re-pull. |
| Rolling `latest` for stateful services | Unattended schema/migration changes are how you lose data. |
| Digest-pinning the stateful tier | Unreadable in diffs; snapshot-before + backups give the immutability instead. |
| Digest-_only_ pin (no readable tag) for stateful | Unreadable in diffs — the tiered rule pins `tag@digest` (readable tag *and* digest) instead (Decision 2). |
| Pinning the stateless tier | No durable data to protect; pins just add churn DIUN already covers. |
| Auto-updating stateful on a timer | Must be human-gated and backup-first; only the _analysis_ is automated. |
| Updating the whole fleet at once | Simultaneous reboots hide which host/phase actually broke. |
| 8-weekly as the only stateful path | Too slow for urgent CVEs — hence the DIUN security fast-path. |
---
## Consequences
- A single uniform update policy is rejected: the stateful/stateless split is
load-bearing, so stateless services roll on rolling tags while stateful services are
pinned `tag@digest`, human-gated, and backup-first (see "What was ruled out").
- The weekly run never touches stateful services and the whole fleet is never updated
at once, accepting the added orchestration of host ordering and an 8-weekly +
fast-path cadence in exchange for bounded blast radius (see "What was ruled out").
- No update automation ships until the health-check verification gate is in order; the
pipeline is deliberately sequenced behind that harness (see Decision 6).
- Several points remain open for discussion (see "Open questions"): where the Proxmox
snapshot is driven from across the TF/Ansible boundary; the exact cadences; where the
health-check harness lives and the minimum bar that counts as "in order"; whether
classification is a per-role `__stateful` flag or a group_vars list; whether the
weekly run hits staging first; and the notification + "skip/pause" control channel.

View file

@ -1,5 +1,9 @@
# ADR-012 — Hardware reference & capacity evaluation
## Status
Accepted (2026-06-01)
## Context
The repo modelled the logical/network layer (Terraform VM specs, ADR-007
@ -13,6 +17,8 @@ workload that should move, or a node due an upgrade.
- `docs/hardware/reference.md` is the single, hand-maintained source of truth for
physical compute + network gear and workload placement intent. Two
machine-readable tables (node capacity, workload placement) carry the numbers.
This includes `ubongo`, the physical control node (ADR-015), even though it sits
outside the Proxmox cluster.
- `scripts/capacity-scan.py` (stdlib-only, like `repo-scan.py` / `tf_to_inventory.py`)
parses those tables, computes per-node allocated-vs-physical rollups, and
cross-checks workload hostnames against `terraform output -json` /
@ -34,5 +40,11 @@ workload that should move, or a node due an upgrade.
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
not assumed.
See also: ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).
## Related
ADR-001 (architecture), ADR-007 (network), ADR-009 (TF ↔ Ansible handoff).

View file

@ -1,5 +1,9 @@
# ADR-013 — Heritage: learning from AnsibleBaobabV4 without inheriting it
## Status
Accepted (2026-06-04)
## Context
boma is the methodology successor to AnsibleBaobabV4 (and V3 before it) — not a new
@ -10,7 +14,9 @@ structure and assumptions creep back in under the guise of "inspiration." This A
sets the policy for drawing on V4 without inheriting it. (Resolves the questions
previously parked in TODO 3.3 and 10.1.)
## Principle — translate, don't transplant
## Decision
### Principle — translate, don't transplant
V4 is **evidence, never authority.** It can show what was needed or what went wrong;
it can never be the reason boma does something a certain way.
@ -21,7 +27,7 @@ it can never be the reason boma does something a certain way.
- **Acceptance test** for anything V4-derived: *can it be justified purely from
boma's principles, with zero reference to V4?* If not, it does not land.
## What V4 is — and is not — a source of
### What V4 is — and is not — a source of
| Legitimate source of | Never a source of |
|---|---|
@ -33,7 +39,7 @@ it can never be the reason boma does something a certain way.
Only concrete, verifiable, low-level knowledge crosses over — precisely because it is
safe to re-derive, whereas structure and requirements drag assumptions along.
## Provenance — transient only
### Provenance — transient only
When a boma decision was prompted by a V4 lesson, or a config adapted from V4, the
lineage is recorded only in **transient** places: the commit message, the working
@ -42,7 +48,7 @@ extraction warrants one. **Durable artifacts (ADRs, role READMEs, `SECURITY.md`)
stand on boma's own terms with no V4 reference.** Honest about lineage in history;
clean in the living repo.
## AI consultation guardrails
### AI consultation guardrails
The AI is the main consumer of V4 — it is on disk and readable. When consulting it:
@ -68,5 +74,7 @@ copy.
cost of a clean methodological break.
- The policy is enforceable in review and by the AI guardrails above.
See also: ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
## Related
ADR-001 (architecture / legibility), ADR-004 (service-role model), ADR-011
(update management — ntfy topics decided fresh per this policy).

View file

@ -1,5 +1,9 @@
# ADR-014 — Sourcing technical knowledge (docs and best practices)
## Status
Accepted (2026-06-04)
## Context
Most work in boma is done by AI agents drawing on training memory, which is stale
@ -85,10 +89,11 @@ The accelerators this policy prefers (`context7`, `deep-research`, `superpowers`
`claude-code-guide`) are **plugins under `~/.claude/`** — local per machine, **not**
synced by Claude account and **not** carried by the git repo (only `.claude/commands`,
`.claude/hooks`, `.claude/settings.json` travel). A fresh clone therefore lacks the
plugin toolchain until it is reinstalled. Making it reproducible from the repo
(`extraKnownMarketplaces` + `enabledPlugins` in `.claude/settings.json`, plus a
bootstrap step) is tracked in `docs/TODO.md` and tied to control-node/AI setup. Until
then, the graceful-degradation fallback above keeps the policy working.
plugin toolchain until it is reinstalled. Making it reproducible from the repo is
**done** (TODO 10.7): `.claude/settings.json` declares `extraKnownMarketplaces` +
`enabledPlugins`, and `docs/runbooks/claude-code-setup.md` documents the per-machine
bootstrap. Until a fresh clone runs that bootstrap, the graceful-degradation fallback
above keeps the policy working.
## Decision
@ -99,5 +104,27 @@ then, the graceful-degradation fallback above keeps the policy working.
- Commit to the principle, not a tool — degrade to `WebFetch`/`WebSearch` when plugins
are absent.
See also: ADR-013 (heritage / translate-don't-transplant), ADR-011 (version pinning),
ADR-008 (testing/verification).
## Consequences
Drawn from the follow-on work and limitations this ADR already states:
- Verified facts carry a durable, greppable stamp; a stamp binds a fact to a pinned
version, so a `requirements` change or image upgrade marks exactly what to re-check
(per Capture / Re-verification).
- Stale-stamp detection — a `/review-repo` or `/security-review` check flagging stamps
whose recorded version no longer matches what is pinned — is a noted enhancement, not
built yet (per Re-verification).
- Any version-specific claim given from memory must be marked "from memory, unverified"
as a transparency backstop, since agent self-assessed certainty is unreliable (per
When consulting is required).
- The policy commits to the principle rather than a specific plugin, so it degrades to
`WebFetch`/`WebSearch` on a bare install; reproducing the plugin toolchain from the
repo is done via `.claude/settings.json` and `docs/runbooks/claude-code-setup.md`,
with the graceful-degradation fallback covering a fresh clone until bootstrap runs
(per Source hierarchy / Reproducibility of the toolchain).
## Related
- ADR-013 — heritage / translate-don't-transplant.
- ADR-011 — version pinning.
- ADR-008 — testing / verification.

View file

@ -0,0 +1,192 @@
# ADR-015 — Control / development / AI-worker host (`ubongo`)
## Status
Accepted (2026-06-05). **Amended 2026-06-18:** the `claude` AI-worker account now has
`NOPASSWD:ALL` sudo on `ubongo` — reversing the original "no local sudo" sub-decision.
The amendment is recorded in §Access & security below; rationale and accepted risk are
in ADR-021 and `docs/security/accepted-risks.md` (R7).
## Context
Earlier ADRs framed the control node — the host that runs Terraform and Ansible —
as a **single Debian 13 VM on the Proxmox cluster**, manually provisioned as the one
documented exception to "Terraform owns VM existence" (ADR-009). That framing treats
the control node purely as a control-plane runner.
It fails four needs, all confirmed as drivers:
1. **Cold-start bootstrap** — the VM that runs Terraform/Ansible cannot exist until
something else creates it; the bootstrap is circular and awkward.
2. **Always-on availability** — the operator wants to SSH in from a work PC or
anywhere to drive Claude Code. A cluster VM is gone whenever the cluster is down
or being rebuilt.
3. **Recovery / disaster** — the tool used to rebuild the cluster must not live
inside the thing it rebuilds.
4. **Dev ergonomics** — a persistent home for Claude Code + the repo, not entangled
with production VM lifecycle.
A laptop-only answer fails always-on and recovery. A VM-only answer fails cold-start
and recovery. A small dedicated always-on physical machine outside the cluster
satisfies all four.
## Decision
Introduce **`ubongo`** (Swahili: *brain*, consistent with the fleet's theme): a
single dedicated x86-64 mini-PC, always-on, living **outside** the Proxmox cluster.
It becomes *the* control node and collapses four roles into one box:
- Terraform + Ansible runner (control plane)
- Claude Code / AI-worker host the operator SSHes into
- Local test runner (Molecule/Docker, lint, and later a browser stack)
- Persistent dev home for the repo
There is **no longer a control VM on the cluster.** The `control` inventory group
points at this physical box. This *strengthens* the ADR-009 control-node exception:
it is genuinely outside Terraform's world, not a VM pretending to be the exception.
Every other host stays a Terraform-managed VM exactly as designed.
`ubongo` runs **plain Debian 13** (the `base` role applies). It is not a production
hypervisor and runs no `docker_host` services. It does run **ephemeral KVM test VMs**
as part of its local-test-runner role (ADR-025 — local VM integration testing): one
throwaway VM at a time (~3 GiB RAM), against ~13 GiB free of the 16 GiB sized here.
This is not a production workload — it is the concrete implementation of ADR-008 Level
2/3, and the resource guard enforces one-at-a-time to stay within the RAM ceiling.
### Hardware target
| Spec | Target | Why |
|---|---|---|
| CPU | 4 cores, x86-64 (Intel N100-class or better) | Molecule containers + Chromium prefer x86 |
| RAM | 16 GB | Docker + headless Chromium + toolchain headroom |
| Disk | 250 GB SSD/NVMe | Docker images, molecule layers, repos, browser cache |
| Network | Wired GbE | Always-on reliability over Wi-Fi |
| Power | Low draw (≤15 W idle) | Runs 24/7 |
Indicative: a refurb Dell/Lenovo/HP micro (USFF) or an N100 mini-PC (~€150250).
Claude Code itself is light (the model runs in Anthropic's cloud); the sizing driver
is **all testing being local** — Molecule (Docker), lint, and a future
headless-Chromium/Playwright stack.
### Provisioning (bootstrap path)
Manual, on bare metal:
1. Install Debian 13 on the box (one-time, by hand).
2. `git clone` the repo; `make setup`; `make collections`; set up `rbw` + unlock.
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
4. From then on `ubongo` manages every other host normally; Ansible manages *it* for
baseline config via the `control` group (`base` role only).
### Access & security
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
stays inside ADR-002.
- `ubongo` runs the `base` role: SSH hardening, nftables default-deny, fail2ban,
auditd, unattended-upgrades. Inbound SSH is allowed **only on the mesh interface**,
denied on the physical NIC.
- **Operational reality (until the mesh exists):** the "SSH only on the mesh interface"
target above is the end state, not yet in force. Today remote access is **LAN SSH
only** — key-only, with password auth and root login disabled — until the NetBird mesh
(ADR-016) is stood up.
- **AI-worker identity:** `ubongo` runs the AI worker under a dedicated,
password-locked `claude` user (in the `docker` and `libvirt` groups; **`NOPASSWD:ALL`
sudo** via a repo-managed drop-in — see amendment below). It is reached via `sudo -iu
claude` or its own SSH key. The rationale is **attribution + revocation, not
containment**: auditd/Loki (ADR-018) can separate human from agent actions, and the
account/key can be revoked without touching the operator's access. (ADR-021 left the
on-`ubongo` agent identity unspecified; this records it.)
**Amendment (2026-06-18) — `claude` now has `NOPASSWD:ALL` sudo.**
> **Superseded by [ADR-025](025-local-vm-integration-testing.md)** (per ADR-023 §4): the
> "no local sudo" sub-decision is reversed. The shakedown that necessitated it is ADR-025;
> the resulting two-account access model is ADR-021; the accepted risk is R7.
During the
integration-testing harness shakedown, the original "no local sudo" sub-decision was
reversed. No-sudo blocked the AI-worker from diagnosing a failed VM: `virsh`,
`virt-install`, `cloud-localds`, `journalctl`, `nft` — nearly all low-level
diagnostic commands — require root. The AI-worker must autonomously spin up,
inspect, and tear down test VMs without operator hand-holding; that is the harness's
core value proposition. Compensating controls make the risk acceptable:
1. `claude`'s password is **locked** (no interactive login, no `su claude` without the
operator's own credentials) — `NOPASSWD` sudo is the *only* sudo path.
2. `auditd` + Loki attribution (ADR-018) separates human from agent root actions.
3. The drop-in is **repo-managed** via `base__ai_worker_user` — revocable in one commit
and one deploy.
4. Single-operator homelab: everything in git, off-machine backups (ADR-022).
The operator (`sjat`) uses **password-required sudo** via the `sudo` group; their
former `NOPASSWD` drop-in was removed 2026-06-18 as redundant once `claude` had sudo
(least-privilege cleanup). The accepted risk is registered as R7 in
`docs/security/accepted-risks.md`. ADR-021 records the resulting sudo model for both
accounts.
- **Disk encryption:** `ubongo`'s SSD is **not encrypted at rest** — the SanDisk X600 is
TCG-Opal-capable but Opal is unused. This is an accepted risk recorded in
`docs/security/accepted-risks.md` (control-node disk not encrypted at rest),
compensated by physical security, a BIOS supervisor password, and disabled
external/USB boot.
### Recovery model
`ubongo` is the rebuild tool, so three things must survive a full cluster loss:
1. **`mamba` (laptop) is a break-glass clone** — repo + toolchain + mesh + `rbw`,
able to drive the fleet if `ubongo` dies.
2. **Terraform state** lives on `ubongo`, backed up encrypted off-box (synced to
`mamba`). For a 25 VM fleet it is also reconstructable via `terraform import`.
3. **Vault password**`ubongo` gets it from Vaultwarden via `rbw`. `rbw` keeps a
local encrypted copy of the vault and decrypts it offline with the operator's
Vaultwarden master password, so `ubongo` can decrypt the Ansible vault with the
whole cluster down — provided `rbw` has synced once and the operator keeps the
Vaultwarden master password offline (memorised + paper in a safe). Mirror onto
`mamba`.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Mirroring Vaultwarden onto `ubongo` is rejected: it
would make the control node run a service (against its remit) and still need that
master password.
> verified: rbw offline-cache decryption · rbw 1.15.0 on ubongo · with the Vaultwarden
> host blocked, `rbw sync` failed but `rbw get` decrypted the cached vault offline ·
> 2026-06-11
## Consequences
- The control node is physical compute outside the cluster, so it appears in
`docs/hardware/reference.md` even though it is not a cluster node (ADR-012).
- All testing (Molecule, lint, staging/external) runs on `ubongo` (ADR-008).
- A future **service-UI acceptance** testing level (Claude driving a headless browser
against a deployed service) is anticipated; `ubongo` is sized for it. The harness
is a separate spec.
## Deferred (separate specs / discussions)
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
(off-site, so it survives a homelab outage and stays out of the cluster it
administers). Replaces ADR-007's OPNsense WireGuard.
2. **Browser-E2E verification harness — RESOLVED (ADR-017):** Claude-driven
exploratory service-UI verification (`/verify-service`, ADR-008 Level 4), against
staging with test users in Authentik. Design + skill + standards complete; running
deferred on the stack.
3. **`rbw` offline-cache verification — RESOLVED (2026-06-11 build):** confirmed offline
cache decryption on rbw 1.15.0 — `rbw sync` fails with Vaultwarden unreachable while
`rbw get` still decrypts from the local cache (ADR-014).
## What was ruled out
| Option | Reason |
|---|---|
| Keep control node as a cluster VM | Fails cold-start, recovery, always-on. |
| Laptop-only (`mamba` for everything) | Fails always-on. Retained as break-glass backup. |
| Split roles (control VM + thin jump box) | Two toolchains, split control plane, heavy testing back on a cluster VM. |
| Mirror Vaultwarden onto `ubongo` | Control node would run a service; still needs the master password. |
| Self-hosted mesh coordinator on the cluster | Recreates the chicken-and-egg. |
| Raspberry Pi | Chokes running Docker + Chromium + toolchain together. |
## Related
ADR-001 (architecture), ADR-005 (bootstrapping), ADR-008 (testing),
ADR-009 (provisioning handoff), ADR-012 (hardware/capacity), ADR-002 (security).

View file

@ -0,0 +1,166 @@
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Status
Accepted (2026-06-05). Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
weigh. This ADR settles it.
## Decision
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
The decision in four parts:
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
survives a homelab outage and stays out of the cluster it administers.
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
parity) — ruled out below.
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (25
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
5. **Identity — embedded local users** (Dex in the management container); external SSO
(Zitadel/Keycloak) stays an optional future.
## Verified facts (ADR-014)
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
open-core feature gating.
## Architecture
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
allocated for it.
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
- `ubongo` — agent.
- All Linux managed hosts — agent via the `base` role.
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
- OPNsense / `mgmt` — single non-agent exception.
## Security
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
privilege.
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
consumed by `base`; prefer ephemeral/scoped keys.
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
explicit the control-node SSH allow that the recovery model already implied; the access
doctrine and the three-tier access ladder live in **ADR-021**.
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
Recorded as accepted-risk R3.
## Recovery & operations
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
mesh/coordinator outage never blocks on-LAN runs.
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo`
`base` enrolls the fleet.
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
NetBird's management datastore is **intended** to be backed up encrypted off `askari`
(synced to `ubongo`/`mamba`; not yet built — see the Availability amendment / R8); peers
keep last-known config through a brief coordinator outage.
- **`askari` is Ansible-managed:** its own inventory group `offsite_hosts` — provisioned
as **Terraform IaC** (`hetznercloud/hcloud`), managed independently of the Proxmox
cluster (its own provider + local state). Ansible configuration: `base` role, plus a
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
## Consequences
- A new public surface appears on `askari` — management API + dashboard (80/443) +
Coturn (3478) — mitigated by TLS, embedded-IdP login, source-IP limits where
practical, `base` hardening and version-pinned NetBird, and recorded as accepted-risk
R3 (Security).
- On-LAN SSH never depends on the mesh: `base` allows inbound SSH from `ubongo`'s LAN
address as a mesh-independent secondary path, so a mesh/coordinator outage never
blocks on-LAN SSH and Ansible stays off the mesh (Security; Recovery & operations).
- The mesh survives a homelab outage because the coordinator is off-site on `askari`,
with its management datastore **intended** to be backed up encrypted off `askari` (not yet built — see the Availability amendment / R8) and peers keeping
last-known config through a brief coordinator outage (Recovery & operations).
- Choosing NetBird over plain OPNsense WireGuard, Tailscale, Tailscale+Headscale, an
on-cluster coordinator, a `ubongo` subnet router, and a standalone IdP gains
identity/ACL policy, self-hosted sovereignty, no routing SPOF, and a light single
operator footprint (What was ruled out).
- Implementation is pending: the role tasks land only once the unbuilt `base` role and
service-role machinery exist (Status).
## Availability — an `askari` outage (amendment 2026-06-20)
The coordinator is deliberately **single** (one off-site host). Recorded here so its
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
radius**:
| Traffic | `askari` down |
|---|---|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
operations, above).
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
hosts get the same pin via `base__mesh_coordinator_pin`.
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
default-deny posture; only helps established sessions), a second relay (needs another public
host / reintroduces the home public surface), a second coordinator (unsupported by
self-hosted NetBird; against this ADR).
## Related
ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted),
ADR-021 (operational access; SSH ladder reconciling `wt0` + `ubongo`'s LAN address).

View file

@ -0,0 +1,112 @@
# ADR-017 — Service-UI acceptance verification (Level 4)
## Status
Accepted (2026-06-05). Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
## Context
ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a
Level 4 stub. Nothing below Level 4 exercises a service's **application UI** — none
answer "does PhotoPrism actually let me log in, upload a photo, and see a thumbnail?"
(TODO 8.2). The operator's ask (TODO 2.2 headless browsing + TODO 2.3 test users +
manual-test instruction): Claude spins up a browser, *sees* the service UI, exercises
it, generates test users, and instructs the operator on manual tests. Today Claude sees
a browser only passively (`/screenshot` fetches operator-taken shots from `mamba`); this
is the active counterpart.
## Decision
A Claude-driven exploratory service-UI verification harness — **Level 4** — invoked as
`/verify-service <name>` on `ubongo`. Five settled forks:
1. **Claude-driven exploratory** — Claude navigates with judgment, not deterministic
scripts. A scripted regression suite is explicitly not built here.
2. **Interactive, Claude-in-the-loop** — exploratory judgment can't be a headless cron
gate; scheduled smoke is a determinism job for health checks / Uptime Kuma later.
3. **Staging, full exercise** — Claude creates test users and exercises features
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
resolves safety.
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
Caddy (ADR-024) + Authentik as a real user would.
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
acceptance spec of critical journeys; Claude executes it and explores beyond it.
## VERIFY.md standard
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from
`docs/testing/service-verify-template.md` — parallel to `SECURITY.md` from
`service-security-template.md`. A new role convention. It lists the service's critical
user journeys (what "working" means), what good looks like, and what is not
browser-verifiable (→ manual handoff). It also joins the pre-production gate in
`docs/security/service-checklist.md`.
## Test-user standard (TODO 2.3)
Test identities live only in the **staging** Authentik (never production): a dedicated
`test` group / naming prefix; ephemeral per-run credentials (staging is rebuildable, so
nothing persisted, none in `vault.yml`); reuse-or-create; teardown via staging rebuild
or explicit `test`-group cleanup.
## Reporting & manual handoff
`/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md` (+ `latest.md`),
mirroring `/review-repo` and `/capacity-review`: pass/fail per `VERIFY.md` journey,
observations, the test-user/env used, a verdict, and a structured **manual-test
checklist** for anything Claude can't do (physical device, paid/external flow,
subjective judgment) — the "instruct me on tests" output. Screenshots are saved to a
git-ignored working dir on `ubongo` (PNG bloat + secret-leak risk); the report links
them.
## Safety
- **Staging-only guard** — the skill refuses to run against production (exploratory
clicking is destructive); ADR-002-aligned hard stop.
- **Confined blast radius** — test users only in the staging `test` group; the run
sticks to the target service.
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
avoid capturing credential screens.
## Dependencies
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
- A staging deploy of the service (ADR-008 Level 2) — staging is currently empty stubs.
- `make new-role` scaffolding `VERIFY.md` — deferred to when that scaffold is next touched.
## What was ruled out
| Option | Reason |
|---|---|
| Scripted Playwright regression suite | Operator wants exploratory judgment; scripts add maintenance burden. Could be a later layer, not this. |
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Caddy+Authentik path; central test users are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
## Consequences
- The harness is confined to staging by a hard stop: it refuses to run against
production because exploratory clicking is destructive, the blast radius is bounded to
the target service, and test users live only in the staging `test` group (Safety).
- No secrets leak: the git-ignored screenshot dir is the safety boundary and credential
screens are avoided (Safety; Reporting & manual handoff).
- Test identities are ephemeral per-run credentials in the staging Authentik only —
never production, none persisted in `vault.yml` — created reuse-or-create and torn
down via staging rebuild or `test`-group cleanup (Test-user standard).
- Anything Claude cannot exercise (physical device, paid/external flow, subjective
judgment) is handed off via a structured manual-test checklist in the run report
(Reporting & manual handoff).
- Authoring is possible now (this ADR, the `VERIFY.md` template, the `/verify-service`
skill, conventions/checklist edits), but running is deferred on its dependencies:
`ubongo`, the `playwright` plugin, Authentik, a staging deploy, and `make new-role`
scaffolding `VERIFY.md` (Status; Dependencies).
## Related
ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).

View file

@ -0,0 +1,124 @@
# ADR-018 — Logging and log integrity
## Status
Accepted (2026-06-06). Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
off-site watchdog). Undecided: the architecture and the **integrity** question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs **leave the host in
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
local copy is futile. How far to harden the central store is set by the threat model.
## Decision
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
forensic-grade.
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
trends. Near-real-time shipping already defeats per-host track-covering.
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only**
tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
proportionate control.
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
retention + wearout monitoring.
## Architecture
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
single-binary mode; NVMe; bounded retention.
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
only, write-only, long retention, tiny volume.
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
+ the alerting ADR-002 calls for.
## Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
it syslog-forwards its alerts to the ingest point), and key container security events.
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
## Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
shipping but not alter shipped history; **a host going silent is itself an alert**; a
stolen push credential appends noise but can't delete; an `askari` outage buffers +
flushes on reconnect.
## Retention & disk-wear
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
bounded hot retention (~3090 days). `askari` subset: long (~1 year+, ~525 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted `auditd`
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
## Consequences
- Opportunistic track-covering and host-pivot-to-store are defeated because logs leave
the host in near-real-time and the off-cluster security trail is append-only, so it
survives full-cluster compromise (Security, integrity & residual risks).
- Conscious residuals remain: append-only is not cryptographic WORM (root-on-`askari`
could edit chunks — R4); there is a few-seconds un-shipped window; agent compromise
can stop future shipping but not alter shipped history; a stolen push credential
appends noise but cannot delete; and an `askari` outage buffers then flushes on
reconnect (Security, integrity & residual risks).
- A host going silent is itself an alert (Security, integrity & residual risks).
- Only a bounded security subset ships off-site — `auditd`, `authpriv`, `fail2ban`,
AIDE, Suricata and key container security events tagged `security="true"` — while the
cluster Loki holds everything, keeping off-site volume small (Data flow & the security
subset).
- Disk-wear is a managed parameter: log storage on NVMe/SSD or HDD never SD/USB flash,
bounded verbosity at source, tuned Loki retention/compaction, and monitored SSD
wearout/TBW with an alert; log storage is a tracked allocation in
`docs/hardware/reference.md` (Retention & disk-wear).
- The decision is authorable now but the live pipeline is deferred on the stack:
Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog config, and the
push-only credential (Status; Dependencies).
## Related
ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).

View file

@ -0,0 +1,113 @@
# ADR-019 — Tagging standard for targeted, predictable runs
## Status
Accepted (2026-06-06). Resolves TODO 3.7 ("Define a tagging standard that lets us
target runs without over-tagging") and TODO 3.11 ("Deliberate tagging strategy").
## Context
boma wants to run playbooks **targeted** — a single service, a single layer, or a
single cross-cutting concern — **transparently and predictably**: a reader should
know from a `--tags` invocation exactly what it will and won't touch. CLAUDE.md
already requires tag-filterable tasks, but no vocabulary or convention existed, and
the TODO explicitly warns against the opposite failure mode: **over-tagging**.
## Decision
### Two-tier tagging
**Tier 1 — role/service tag (mechanical).** The tag equals the role name, applied
once at the role-import level:
```yaml
roles:
- role: photoprism
tags: [photoprism]
```
Ansible propagates it to every task in the role. Because one service = one role
(ADR-004), this single rule covers both the *layer/role* and *single-service*
targeting axes with zero per-task burden. Role-less lifecycle playbooks
(e.g. `bootstrap.yml`) carry a single playbook-identity tag instead.
**Tier 2 — concern tag (curated).** A small **closed list** of cross-cutting concern
tags, applied per-task/block **only where a task genuinely belongs to that concern**.
### The closed concern list
A concern earns a tag only if it (a) appears in 2+ roles, (b) is worth running as a
slice on its own, and (c) doesn't overlap confusingly with another.
| Tag | Covers |
|-----|--------|
| `packages` | apt package install/management |
| `users` | accounts, groups, sudo |
| `firewall` | nftables rulesets & port definitions (ADR-002) |
| `hardening` | security baseline — sshd config, fail2ban, auditd, sysctl |
| `logging` | Alloy / log-shipping config (ADR-018) |
| `monitoring` | metric exporters / health checks |
| `config` | render templated config/compose files to disk — **no restart** |
| `deploy` | bring services up / restart (`compose up -d`) |
| `proxy` | reverse-proxy + TLS registration (Caddy routes, Authentik) |
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
config`) without bouncing services, then restart deliberately (`--tags deploy`).
`backup` and `secrets` are intentionally omitted until the roles needing them exist.
### `always` / `never`
- **`always`** — reserved for cheap preflight assertions (vault unlocked, OS is
Debian 13, required vars present), so even `--tags config` runs its safety guards.
- **`never`** — reserved for destructive/expensive opt-in tasks, each paired with a
descriptive tag (e.g. `tags: [never, force_pull]`); they run only when named.
### Predictability principle: tags are union-only
`--tags a,b` runs tasks tagged a **OR** b — Ansible has no native AND. boma therefore
targets **one axis at a time**: either a role/service *or* a concern, never an
intersection like "photoprism's firewall only." If that's ever needed, just run
`--tags photoprism` (idempotent and fast). Designing for intersection is the
over-tagging trap; we decline it on purpose.
### Terraform / Proxmox VM tags (metadata only)
Every Terraform-managed VM carries exactly three Proxmox tags:
| Tag | Value | Purpose |
|-----|-------|---------|
| env | `staging` \| `production` | which environment |
| role/group | `docker_hosts`, `proxmox_hosts`, … | matches the inventory group |
| managed-by | `terraform` | distinguishes IaC VMs from hand-made ones |
These are **pure metadata for transparency** (glanceable in the Proxmox UI). They do
**not** drive run-targeting and do **not** feed inventory — `scripts/tf_to_inventory.py`
keeps building groups from the `group` output field, the single source of truth.
## Enforcement
`tests/tags.yml` is the single source of truth for the allowed concern/special/
opt-in/playbook tags. `scripts/check-tags.py` (run by `make lint`, covered by
`tests/test_check_tags.py`) scans `roles/` and `playbooks/` and fails on any tag
outside `{role directory names} {tests/tags.yml entries}`.
Molecule scenario files (`roles/*/molecule/**`) are excluded from the scan — they are test orchestration, not the production run-targeting surface this standard governs.
It also checks that every role imported in a play's `roles:` block carries its own role name as a tag (additional tags are allowed).
## Extending the vocabulary
To add a concern tag: (1) add it to `tests/tags.yml`; (2) add a row to the concern
table above with a one-line justification showing it passes the litmus test
(cross-cutting, 2+ roles, distinct). That is the whole gate — lightweight, but it
leaves a paper trail.
## Consequences
- Targeted runs are predictable: only two kinds of tags exist, one of them mechanical.
- Over-tagging is structurally resisted (closed list + lint enforcement).
- Intersection targeting is unavailable by design.
- Authors must keep role tags = role names. `make lint` enforces both the *vocabulary* (every tag is a known role name or approved tag) and that each role import in a `roles:` block carries its own role-name tag (extra tags allowed).
## Related
ADR-002 (security baseline / firewall), ADR-004 (one service = one role),
ADR-009 (TF↔Ansible handoff / inventory), ADR-018 (logging).

View file

@ -0,0 +1,150 @@
# ADR-020 — Firewall strategy: two-layer model with a shared service catalog
## Status
Accepted (2026-06-06). Resolves TODO 3.5 ("Decide the firewall strategy — which
firewall, ruleset, per-host vs central").
**Strategy ADR.** It pins the architecture and each layer's responsibilities; the
detailed builds are separate follow-up efforts (see *Scope*).
## Context
boma needs a firewall strategy that is predictable, declarative, and defends the stated
threat model — opportunistic external, lateral movement / blast radius, operator/agent
error (ADR-002). The pieces were already committed across other ADRs (`nftables`
default-deny on hosts — ADR-002; OPNsense at the perimeter — ADR-007; Docker with
`iptables: false` — ADR-004), but nothing tied them together: which layer owns what,
where firewall intent is declared, and how the layers stay consistent. Without that,
ports drift open ad-hoc and "per-host vs central" stays unanswered.
## Decision
### Two layers, distinct jobs
**OPNsense — perimeter + inter-VLAN.** Owns the WAN edge and all policy *between zones*:
`lan`/`iot`/`guest``srv`, `mgmt` access, and the per-VLAN egress rules (ADR-007). It
is **structurally blind to intra-`srv` traffic** — services share the switched `srv`
subnet (VLAN 20), which never reaches the gateway.
**Host nftables — host-local + east-west within `srv`** (in the `base` role, every VM):
- **Default-deny inbound**; allow loopback + established/related.
- **East-west allowlist**: a service host accepts a connection only from declared
sources (e.g. the reverse proxy, a named peer) — the lateral-movement control OPNsense
cannot provide.
- **Permissive egress**: allow outbound + established/related; per-VLAN egress
restriction stays at OPNsense (ADR-007). Host-level egress allowlisting is
high-friction (every DNS/NTP/update/registry/webhook must be enumerated) for limited
added benefit once the VLAN already bounds where a host can go.
- **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering,
including container traffic (ADR-004).
- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
catalog, applied atomically — a malformed or empty catalog can never lock out
management. The control-node source is part of the guaranteed plane, not the service
catalog (it is management, not a service); see ADR-021 for the access doctrine.
So "per-host vs central" is answered: **both**, with clear ownership.
### Single source of truth — a shared service catalog
A central, declarative **service catalog** in `group_vars/` is the one source of truth
for firewall intent (aligning with ADR-002's "port definitions live in `group_vars/`",
and keeping connectivity *topology* in inventory rather than in any one self-contained
service role — ADR-004). Each entry describes a service's **ingress**:
```yaml
photoprism:
ingress:
- { from: reverse_proxy, port: 2342, proto: tcp }
reverse_proxy:
ingress:
- { from: lan, port: 443, proto: tcp }
```
`from` is **symbolic**, resolved at render time: a host/group → IP(s) from inventory; a
role (`reverse_proxy`) → the host(s) filling it; a VLAN/zone (`lan`) → the subnet from
the ADR-007 table. This keeps the catalog readable and resilient to IP changes.
### Each layer renders only its own slice
| Ingress rule | Host nftables | OPNsense |
|---|---|---|
| `from: reverse_proxy` (a `srv` peer) | allow proxy IP → port | — (intra-`srv`, invisible) |
| `from: lan` (cross-VLAN) | allow `lan` subnet → port | allow `lan` → host:port |
The dominant pattern falls out naturally: most services are **proxied** — their only
ingress is `from: reverse_proxy`, and users reach them through the reverse proxy, which
alone carries `from: lan, port: 443` (matches "services sit behind the reverse proxy
with authentication", ADR-002).
This was chosen over a single connectivity-model-generates-both (too much machinery,
tight coupling of two very different rule domains) and over fully independent per-layer
declarations (real drift risk).
### Off-cluster hosts — `askari` (Hetzner)
`askari` sits outside the Proxmox cluster and has no OPNsense. Its **perimeter** layer
is a TF-managed **Hetzner Cloud Firewall** (declared in `terraform/environments/offsite/`)
alongside the VM itself. Rule set: SSH inbound from `ubongo`'s public IP (M2), plus
TCP 80/443 + UDP 3478 opened in **M4a** (Caddy + NetBird). The `netbird_coordinator`
service role that uses 3478 lands in **M4b**; the ports are already open.
The `group_vars` service catalog remains authoritative for `askari`'s **host nftables**
layer — the same two-layer model applies, with Hetzner Cloud Firewall substituting for
OPNsense at the perimeter.
---
### OPNsense automation — owned here, mechanism deferred
OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform
OPNsense provider"). It renders the cross-VLAN slice of the catalog plus the static
ADR-007 facts. The **how** — config-XML templating vs the OPNsense API vs a plugin — is
deferred to the OPNsense-as-code follow-up spec. Recorded as an explicit open
sub-decision.
## Guardrails
- **The catalog is authoritative.** If a port is not in the catalog, it does not exist —
hardening the existing rule "never open a firewall port ad-hoc on a host" (ADR-002).
- **The `firewall` tag** (ADR-019) marks firewall tasks; `--tags firewall` re-renders
rules.
- **Drift detection (aspiration).** A deterministic check — in the spirit of
`scripts/check-tags.py` — comparing each host's live `nft` ruleset / listening ports
against the catalog and flagging anything undeclared. Ties to TODO 8.5
(`/security-review`). Not necessarily built first.
## Consequences
- Lateral movement within `srv` is constrained — the gap OPNsense structurally can't
close.
- One declarative catalog → no ad-hoc ports and no cross-layer drift on shared facts
(ports, IPs, sources).
- Cost: the catalog + render-per-layer machinery must be built and maintained; east-west
allowlisting adds per-service ingress declarations (mitigated by proxied-by-default,
which keeps most entries to a single line).
## Scope
**Decided here:** the two-layer model and responsibilities; host nftables = default-deny
inbound + east-west allowlist + permissive egress + guaranteed management plane + Docker
`iptables:false`; the shared `group_vars` catalog as single source of truth with
symbolic sources; each layer renders its own slice; the no-ad-hoc-ports guardrail.
**Deferred to follow-up specs (each its own brainstorm → plan):**
1. **Host nftables implementation** in `base` — catalog schema, nftables template,
Docker `iptables:false` integration, fail-safe ordering, Molecule tests. The natural
next spec.
2. **OPNsense-as-code** — tooling mechanism + cross-VLAN rule rendering.
3. **Drift-detection check** — if/when built.
## Related
ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),
ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag),
ADR-021 (operational access doctrine; `ssh-from-control` management-plane source).

View file

@ -0,0 +1,238 @@
# ADR-021 — Operational access: documented, verifiable ways in
## Status
Accepted (2026-06-09). Resolves TODO 7.2 (what to set up on hosts given direct access
will be rare) and TODO 3.2 (the service admin-API access question). **Amended
2026-06-18:** the on-`ubongo` sudo model for the two local accounts is now settled
(see §Sudo model on `ubongo` below).
**Doctrine ADR.** It pins the operational-access doctrine, the declarative `access__*`
data model, the rendered `ACCESS.md` record, and the `/check-access` verifier. It does
**not** build any of them — `base`'s non-firewall concerns, service roles, and live
hosts do not exist yet. Designed now, built when there is something to access (see
*Scope*). Reconciles a latent contradiction between ADR-016 and ADR-020 (see
*Reconciliation*).
## Context
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
ports (ADR-002/ADR-020). That posture is correct — but it leaves one operational
question unanswered: **when a host or service breaks, how does the operator (and the AI
working from `ubongo`) actually get in to troubleshoot it?**
Troubleshooting is far more effective with *several* documented ways in — SSH, container
exec, logs, an admin API — so a single broken path does not mean blind. Today boma has no
standard guaranteeing those paths exist, are documented, or still work. The risk is the
classic one: the access you assumed you had is stale exactly when you need it (key
rotated, API disabled, token expired).
boma already has the right *shape*. Service roles carry record docs — `SECURITY.md`
(security answers) and `VERIFY.md` (acceptance spec). What is missing is the third
sibling — an operational-access record — and the doctrine behind it.
Two constraints shape the decision:
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
paths over *trusted* interfaces, never new exposed ports.
2. **A documented path that is never tested drifts** — it fails exactly when needed. So
the access facts must be *data* that both renders the doc and drives an active
verifier; the two can then never disagree.
## Decision
### The doctrine
> **Every host and every service guarantees at least one documented, verifiable way in
> for operational troubleshooting — and the deploy that creates it also records and
> proves it.**
Access is a deployment deliverable, not something rediscovered under pressure. The deploy
that creates a host/service also records its access paths and (by design) proves them.
### Two layers
- **Host layer** (resolves TODO 7.2). Every host, via the `base` role, guarantees a fixed
access baseline: SSH over `wt0` and from `ubongo` (the ladder below), Docker/Compose
tooling present, and log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a
known, uniform set of paths exists over trusted interfaces. The break-glass console per
host class is recorded once at this layer. This is boma's answer to "what every host
runs for access."
- **Service layer** (resolves TODO 3.2). Every service role guarantees and records its
own paths: container exec + compose management, its Loki log labels, and its admin API
where one exists (enabled, token in vault, endpoint + health probe documented) — or an
explicit "no API."
### The three-tier access ladder
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
before SSH sees it. The preferred path (ADR-016's original rationale).
2. **LAN SSH from `ubongo` only — secondary, mesh-independent.** All hardware but
`askari` shares a LAN. SSH from `ubongo`'s LAN address is allowed, giving a fallback
that survives a NetBird/`wt0` outage. It is gated by *source IP* (spoofable on a LAN)
**plus** the standing keys-only + fail2ban SSH hardening (ADR-002), so the marginal
cost is "SSH daemon reachable from one trusted LAN host" — modest and deliberate. All
*other* LAN hosts stay default-denied.
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, never
exercised for routine work:
- **Cluster VMs** → Proxmox serial/VNC console — independent of the guest network,
`wt0`, and even a broken guest nftables ruleset.
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
- **`ubongo`** (physical) → local console.
A total mesh outage therefore still leaves exactly one documented way in to each box.
### Reconciliation, not weakening
ADR-016 already requires Ansible to reach the fleet by LAN IP — "a mesh/coordinator
outage never blocks on-LAN runs" — which **requires** LAN SSH from `ubongo`. Yet ADR-016
also stated "SSH only on `wt0`," and ADR-020's guaranteed management plane listed only
`wt0`. That was a latent contradiction. ADR-021 resolves it by making the control-node
SSH allow **explicit** and adding it to the guaranteed management plane. This does **not**
weaken default-deny: it admits exactly one extra trusted source on the LAN (`ubongo`),
keys-only + fail2ban-gated; every other LAN host stays denied. ADR-016 and ADR-020 are
amended to cross-reference this ladder.
### The declarative `access__*` data model
Structured access facts live as **data** — the single source of truth that both renders
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge
(the firewall-catalog philosophy of ADR-020, applied to access).
Each service role's defaults carry:
```yaml
access__service: photoprism
access__compose_project: photoprism # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db] # exec targets
access__log:
loki_labels: { service: photoprism } # how to query logs (ADR-018)
access__api:
enabled: true
base_url: "http://photoprism.srv:2342" # reachable over the mesh
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
auth: { vault_ref: "vault.photoprism.api_token" }
health_path: "/api/v1/status" # what /check-access pings
# where the service has no API:
# access__api: { enabled: false, reason: "<none upstream>" }
```
**Invariant — `access__api` never opens a port.** It `firewall_ref`s an entry in the
`group_vars` firewall catalog; ADR-020 stays the **sole owner of exposure**. The access
data adds only *how to use* the path (endpoint, token ref, health probe) — no duplication,
no ad-hoc ports (CLAUDE.md: ports only in the catalog).
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
uniform, so it is asserted by `base` and recorded once at the host/group level, not
re-stated per service.
### The rendered record — `ACCESS.md`
`ACCESS.md` is a first-class sibling of `SECURITY.md`/`VERIFY.md`, **rendered** from the
`access__*` data with a prose tail for the narrative parts:
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
invocation.
- **Break-glass (generated from host class)** — the Proxmox/provider/local console line.
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
part a template cannot know.
A `docs/access/service-access-template.md` defines the shape, alongside the existing
security/verify templates.
### The verifier — `/check-access`
`/check-access <service|host>` runs from `ubongo` and turns the `access__*` data into
live probes, reporting which declared paths are green right now — the access analogue of
`/verify-service` (ADR-017). It probes mesh SSH, LAN SSH, exec + compose, Loki logs, and
the admin API health path; on any red it names the path and the likely cause. **Break-glass
is checked for reachability only, never exercised** — firing a serial console is invasive,
so the verifier confirms the fallback *exists* without disrupting anything. Designed now,
**build-pending on infra** (needs live hosts + staging + vault), exactly like
`/verify-service` under ADR-017.
### Governance
Three light touches, mirroring how `SECURITY.md`/`VERIFY.md` are enforced: the service
checklist (`docs/security/service-checklist.md`) gains an access item; the `new-role`
runbook gains a fill/render/`check-access` step (step 11: copy
`docs/access/service-access-template.md` into `roles/<service>/ACCESS.md` and populate the
`access__*` data); and a service-checklist gate item blocks clearance until the record
exists and `/check-access` is green (or a deviation is recorded in `accepted-risks.md`).
No scaffold change — same manual-copy-plus-review pattern the sibling records
(`SECURITY.md`/`VERIFY.md`) use.
### Sudo model on `ubongo` (amendment 2026-06-18)
The original ADR left on-`ubongo` local sudo unspecified. The integration-testing
harness shakedown settled it:
| Account | Role | Sudo |
|---|---|---|
| `claude` | Automated AI-worker | `NOPASSWD:ALL` via repo-managed drop-in (`base__ai_worker_user`) |
| `sjat` | Human operator | Password-required sudo via the `sudo` group |
**Rationale for `claude NOPASSWD`.** No-sudo blocked the AI-worker from diagnosing a
failed test VM: `virsh`, `virt-install`, `cloud-localds`, `nft`, `journalctl`
almost every low-level diagnostic tool — require root. The harness's core value is
autonomous spin-up → apply → reboot → assert → diagnose; that loop collapses without
local root access.
**Compensating controls (R7 in `docs/security/accepted-risks.md`):**
- `claude`'s password is locked — `NOPASSWD` is the account's *only* sudo path; no
interactive login is possible.
- `auditd` + Loki attribution (ADR-018) separates human from agent root actions in the
audit trail.
- The drop-in is repo-managed and revocable in one commit + one deploy.
- Single-operator homelab; everything in git; off-machine backups (ADR-022).
**`sjat` NOPASSWD removed.** The operator's former `NOPASSWD` drop-in
(`/etc/sudoers.d/sjat-ansible`, added as an interim measure during M5 NetBird
enrolment) was removed 2026-06-18. It was redundant once `claude` held sudo, and its
removal restores least-privilege for the human operator. `sjat` retains full sudo
capability via the `sudo` group (password required).
## Consequences
- Every host and service has at least one documented, verifiable way in — and a verifier
that proves it, so stale access is caught before an outage, not during one.
- Doc and verifier share one source of truth (`access__*`), so they cannot drift apart.
- The management plane gains exactly one extra trusted LAN source (`ubongo`); attack
surface grows by one keys-only + fail2ban-gated SSH path, no new exposed ports.
- Cost: per-service `access__*` declarations and a rendered `ACCESS.md` to maintain
(mitigated by the uniform host baseline + the new-role runbook step + checklist gate), plus `/check-access` to build.
## Scope
Delivered by ADR-021's implementation plan
(`docs/superpowers/plans/2026-06-09-operational-access.md`), task by task, and tracked in
`STATUS.md` as it lands — not all of it exists at the moment this ADR is written. The split
below is near-term tranche vs longer build-pending, not instant-existence vs not.
**Near-term tranche (this plan):** the doctrine; this ADR; the `ACCESS.md` template; the
`ssh-from-control` firewall management-plane source — added to ADR-020's *guaranteed
management plane* (the always-allowed block that already holds the `wt0` SSH/Ansible allow
and is explicitly independent of the service catalog), not added to the catalog itself (the
catalog owns service ingress only) — via the `base__firewall_control_addr` knob and its
nftables rule, both of which do **not** exist in `roles/base` yet and land with the
`firewall` concern of `base`; and the governance wiring (checklist item, new-role runbook step). ADR-016 and ADR-020 are amended to reference the ladder.
**Build-pending on infra:** per-service `access__*` data and rendered `ACCESS.md` files
(wait on service roles), `/check-access` *running* (waits on live hosts + staging + vault),
and the real `ubongo` LAN address value behind `base__firewall_control_addr`. Designed now,
built when there is something to verify.
**Out of scope:** broader LAN SSH (a management VLAN) — explicitly rejected, `ubongo`-only;
exercising (vs reachability-probing) the break-glass console; any access path that is not
over the mesh or the one `ubongo` LAN source.
## Related
ADR-002 (security baseline: SSH hardening, default-deny, fail2ban), ADR-004 (Docker
model, Compose), ADR-016 (NetBird mesh; amended — SSH on `wt0` **and** from `ubongo`'s
LAN address), ADR-017 (`/verify-service` Level-4 verification), ADR-018 (logging:
Alloy → Loki/Grafana), ADR-020 (firewall: service catalog + guaranteed management plane;
amended — adds the `ssh-from-control` management-plane source), ADR-019 (`firewall` tag).

View file

@ -0,0 +1,277 @@
# ADR-022 — Backup & disaster recovery: data-only restic, off-cluster pull node, 3-2-1
## Status
Accepted (2026-06-10). Resolves TODO 3.8 ("ensure the right things are backed up,
incl. DB dumps") and `CAPABILITIES.md` §9 (backup engine / off-site / air-gap, all
"planned"). Grounds ADR-011's "backup-first" and "snapshot + backup" language, which
assumed a backup policy existed but never defined one.
**Doctrine ADR.** It pins the recovery model, backup engine, topology, per-service
contract, encryption/escrow, restore-testing tiers, retention, alerting, and USB
air-gap mechanism. It does **not** build any of them — the `backup` role, `fisi`
node, per-service `backup__*` declarations, and `BACKUP.md` files do not exist yet.
Designed now, built in the implementation plan referenced at the foot of this ADR.
## Context
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
copies live, *how* they are encrypted, or *whether restores actually work*.
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
but commits to nothing.
The gap is not just theoretical. Every boma service is stateful in some dimension:
DB contents, bind-mount data dirs, the Vaultwarden vault that holds every secret in
the stack. Without a backup policy the IaC is not reproducible from nothing; it is
reproducible-modulo-data. This ADR closes that gap.
## Decision
### 1. Recovery model — data-only backups, rebuild from code (Model A)
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
Ansible re-renders the Docker Compose stack. Backups therefore protect **state only**
DB contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
Recovery sequence: Terraform re-provisions the VM → Ansible redeploys → restic
restores the data. **No Proxmox Backup Server (PBS) in v1.** This keeps the 3-2-1
topology cheap, fits pCloud's 1 TB comfortably, and turns every restore drill into
a continuous proof that the IaC *and* the backups both work.
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run
plus data restore, potentially hours), and it bets the repo is complete enough to
rebuild from nothing — which Tier-2 restore testing (Decision 8) exists to verify.
**PBS (Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO
proves too slow; nothing here precludes it.
### 2. One backup tier, ~24 h RPO
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
the board. No per-data-type tiering yet — revisit once there is real-world data and
experience to justify the added machinery.
### 3. Engine — restic (data) + rclone (off-site); no second encryption layer
- **restic** captures state into an encrypted, deduplicated repository.
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
rclone has a first-class pCloud backend).
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
encryption layer, no pCloud "crypto folder."
No PBS in v1 (see Decision 1).
### 4. Topology — central pull node (`fisi`), off the cluster; `backup_hosts` group
A single backup node owns the canonical restic repo. It is **off the Proxmox cluster**
— an independent failure domain, so copy 2 survives a PVE node (or the whole cluster)
dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
(off-site): a manually-provisioned physical node in its own inventory group, still
Ansible-managed (the `base` role applies, plus a `backup` role).
**Pull model.** `fisi` holds SSH keys to each host; per service it runs the declared
dump command remotely, pulls the declared paths read-only, then `restic` snapshots the
staged data into its local repo. **Hosts hold no backup credentials and cannot reach
the repo** — a compromised or ransomwared service host cannot delete backup history.
**Node assignment:** `fisi` (an HP Elite 600 G9 tower) is penciled in / provisional —
the *role* ("the backup node") is load-bearing; the physical assignment may be
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
(ZFS or mdraid → 8 TB usable, survives one disk failure). It owns the repo, runs the
pull orchestration, runs `rclone → pCloud`, and docks the USB air-gap drives
(Decision 11).
**Inventory:** a new `backup_hosts` group is added to both inventories, structured
like `control` and `offsite_hosts`. The `base` role applies.
### 5. 3-2-1 mapping
| Copy | Location | Medium | Off-site? | Notes |
|---|---|---|---|---|
| 1 | Live data on each host | NVMe/SSD | no | The working data |
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Consequences) |
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
≥3 copies, ≥2 media, ≥1 off-site — 3-2-1 satisfied, with the air-gap drive as a
fourth, offline copy that no online compromise can reach.
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md`; governance
Each service role declares its backup needs in role vars — the same render-from-data
pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
```yaml
backup__service: nextcloud # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs / files holding state ([] = none)
- /srv/nextcloud/data
backup__dumps: # logical app-consistent dumps ([] = none)
- cmd: "docker compose -p nextcloud exec -T db pg_dump -U {{ vault.nextcloud.db_user }} nextcloud"
dest: nextcloud-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
```
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
service with **no** `backup__paths` must explicitly declare `backup__state: false` with
a reason; omission is never an implicit "nothing to back up." (`backup__state` and the
list-form `backup__dumps` are this ADR's resolution of the spec's open "declared, not
silent" point.)
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md`,
`VERIFY.md`, and `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
what state exists, what is backed up, the dump command, and the per-service restore
procedure. A template lives at `docs/backup/service-backup-template.md`. A **stateless**
service declares `backup__state: false` (with a reason) in its role vars and gets **no**
`BACKUP.md`.
**Governance — runbook + gate, not scaffold (consistent with ADR-021).** Three light
touches mirror how `SECURITY.md`, `VERIFY.md`, and `ACCESS.md` are enforced: the
service checklist (`docs/security/service-checklist.md`) gains a backup item; the
`new-role` runbook gains a fill/render/`check-backup` step (copy
`docs/backup/service-backup-template.md` into `roles/<service>/BACKUP.md` and
populate the `backup__*` data); and a checklist gate blocks service clearance until
the record exists and a restore drill confirms it (or a deviation is recorded in
`accepted-risks.md`). The dormant `/check-backup` verifier is the automated check
analogue of `/check-access` (ADR-021). **No automated lint script gates `BACKUP.md`
presence** — same manual-copy-plus-review pattern the sibling records use. The design
document's "make lint gates its presence" wording is superseded by this governance
choice.
### 7. Consistency — logical dumps first; quiesce as escape hatch
- **Default:** databases are captured with logical dumps (`pg_dump` / `mysqldump`) —
portable, version-independent, restorable to a fresh DB. Plain data dirs are backed
up as files. No downtime required.
- **Escape hatch:** a service whose data cannot be dumped live declares a quiesce step
(stop container → back up volume → restart) via `backup__quiesce` in the same contract.
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
crash-consistent for a live database).
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
way, each service declares how to dump its own data.
### 8. Restore testing — two tiers; `ubongo` stays bare Debian
- **Tier 1 — weekly, automated, rolling restore-verify.** Pick the next service in
rotation, restore its latest snapshot into a throwaway container on `ubongo`
(reusing the Molecule harness, ADR-015), start the app against the restored data,
and run that service's `VERIFY.md` checks (ADR-008/017). This catches the failure
that actually kills people — *silently corrupt or unrestorable backups*. Failures
alert via ntfy.
- **Tier 2 — semi-annual full DR rehearsal,** driven from `ubongo` onto PVE staging.
Rebuild a host from zero via Terraform + Ansible + restic restore on the staging
cluster. This validates the whole Model-A recovery chain. **At least once a year the
rehearsal exercises the paper-secret break-glass path** (Decision 10) end-to-end.
**`ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged).** Its role is to
be the independent recovery anchor — "the tool used to rebuild the cluster must not
live inside the thing it rebuilds." Higher-fidelity real-VM testing is better served
by the PVE staging environment (same hardware class, same cluster, same provisioning
path). `ubongo`'s 1 TB NVMe gives ample room for Tier-1 dataset restores; disk
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
### 9. Retention — GFS via restic
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
Tune once real repo growth is observed.
### 10. Encryption + key escrow + break-glass
restic encrypts the repo at rest, so **one secret — the restic repo password —
protects all copies uniformly** (`fisi`, pCloud, USB). One thing to escrow, not three.
**Escrow locations:**
- **`fisi`, root-only** (plus in the Ansible vault) — so backups run non-interactively
and `fisi` is redeployable.
- **Vaultwarden** — the day-to-day human-accessible copy.
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
copy that survives "everything is down."
**The paper holds *two* secrets:** (1) the **restic repo password** (to read any
backup at all) and (2) the **Ansible vault master password** (to rebuild hosts from
the repo — normally from Vaultwarden via `rbw`, which is itself down in a from-zero
recovery). With both on paper, the break-glass chain has **no circular dependency**:
paper → restic restores Vaultwarden + repo data → the vault password (from paper)
drives Terraform/Ansible re-provisioning → services return, `rbw` works again.
**`mamba` (laptop) is the break-glass clone** (ADR-015): repo + toolchain + mesh +
`rbw`, with Terraform state synced to it — the rebuild can be driven from `mamba` if
`ubongo` is also gone. The paper sheet doubles as a short break-glass runbook assuming
zero running boma infrastructure: install restic on any machine, point it at pCloud
*or* a USB drive with the password, restore Vaultwarden first, then rebuild with the
vault password.
### 11. USB air-gap — plug-and-go cold copy
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
systemd unit / script that: mounts the drive, confirms it is an expected drive, runs
**`restic copy` from the local repo → a restic repo on the USB drive** (same
password → ciphertext if lost/stolen), runs `restic check` on the USB copy, unmounts,
and **notifies via ntfy** with the result. Only allowlisted serials trigger anything —
a rogue USB does nothing.
`restic copy` (not rsync) so the USB is itself a valid restic repo, restorable
directly in a break-glass with nothing else alive. Drives are rotated and **stored
off-site** — a second geographic off-site copy independent of pCloud.
### 12. Failure alerting — guard against silent death
Success/failure pings alone miss the worst case (*the job silently stopped running*):
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
monitor**; no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or dump-step error.
- **Weekly `restic check`** for repo integrity → alert on corruption.
- **Tier-1 restore-verify failures → ntfy.**
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
### 13. Schedule
- **Nightly backup run (~02:0004:00),** driven by `fisi` (pull): per host →
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
`rclone sync` → pCloud. Sequential, off-hours.
- **Tier-1 restore-verify:** weekly, rolling one service per run, on `ubongo`.
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
- **USB air-gap:** manual, approximately monthly, whenever a drive is docked.
## Consequences
- boma now has a defined, end-to-end backup policy that closes the gap ADR-011 left
open; "backup-first" and "snapshot + backup" are no longer assumed.
- Every service role that holds state must declare its backup contract (`backup__*`
vars + `BACKUP.md`); stateless services declare `backup__state: false`. Cost:
per-service declarations and a rendered doc to maintain (mitigated by the new-role
runbook step + checklist gate).
- **pCloud is off-site but sync-coupled**`rclone sync` propagates deletions (a
prune, or a malicious wipe of `fisi`'s repo, replicates to pCloud). The **USB
air-gap drive is the only truly immutable copy**; pCloud's own file-version history
is enabled as a secondary cushion.
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
receives full `base` hardening and tight access. restic encryption means a stolen
`fisi`, USB drive, or pCloud blob yields ciphertext only.
- **pCloud's 1 TB is the off-site capacity ceiling.** Data-only backups fit for years
at homelab scale; flag for `/capacity-review` if the repo trends toward ~1 TB.
- Recovery time under Model A (full Ansible run + data restore) is potentially hours —
slower than a VM-image restore. PBS/Model B is explicitly deferred, not rejected.
- The paper break-glass must be kept current (restic password + vault password). An
outdated paper sheet is the one failure mode this ADR cannot prevent mechanically —
the semi-annual DR rehearsal is the human control.
Full design rationale and worked examples: `docs/superpowers/specs/2026-06-10-backup-strategy-design.md`.
Build path (roles, topology, tests): `docs/superpowers/plans/2026-06-10-backup-strategy.md`.
## Related
ADR-002 (security baseline: hardening applied to `fisi`), ADR-004 (one service = one
role; per-service doc conventions), ADR-008 (testing methodology; Molecule harness
reused for Tier-1), ADR-011 (update management: backup-first rule now grounded),
ADR-015 (`ubongo` recovery model; `mamba` break-glass clone; bare-Debian invariant),
ADR-017 (`VERIFY.md` checks reused in Tier-1 restore-verify), ADR-018 (logging/Alloy
→ ntfy alerting path), ADR-019 (Proxmox tags; `backup_hosts` group), ADR-021
(render-from-data pattern: `access__*`/`ACCESS.md``backup__*`/`BACKUP.md`;
runbook+gate governance model).

View file

@ -0,0 +1,106 @@
# ADR-023 — ADR structure & lifecycle
## Status
Accepted (2026-06-10). Meta/doctrine ADR — pins how ADRs are written; the
`adr-structure` check (`scripts/repo-scan.py`) and `docs/decisions/adr-template.md`
ship with it, and ADRs 001018 were retroactively restructured to conform. Resolves
the FRICTION signal (2026-05-31) about ADR-writing policy being unsettled.
## Context
boma records architectural decisions as numbered ADRs in `docs/decisions/`, and
CLAUDE.md treats them as load-bearing. Yet no ADR said how an ADR is written. The
newest ADRs (019022) converged on a clean shape — Status → Context → Decision →
Consequences → Related — but only by imitation. ADRs 001018 predate it and drifted
widely: most lacked a `## Status` section entirely (016018 carried only a trailing
build-state note), and many lacked an explicit `## Decision` or `## Consequences`
heading, their decisions spread across ad-hoc topical sections. The result was
structural drift and no uniform way to tell an active decision from a superseded or
deprecated one.
## Decision
### 1. Title & filename
Title line: `# ADR-NNN — <Title>: <optional clarifying subtitle>` (em-dash). Filename:
`NNN-kebab-title.md`, zero-padded 3-digit, monotonic, never reused — a superseded ADR
keeps its number and file. A new ADR is registered as a row in the CLAUDE.md
"Further reading" table.
### 2. Mandatory sections, in this order
- `## Status` — a lifecycle line, usually `Accepted (YYYY-MM-DD)` (see §4), plus an
optional one-line note.
- `## Context` — the forces, the problem, what exists today, why now.
- `## Decision` — what we are doing; numbered sub-decisions for multi-part ADRs.
- `## Consequences` — results, trade-offs explicitly accepted, follow-on work.
### 3. Optional sections (use only where they genuinely apply)
`## Related`, `## Scope`, `## Guardrails` / `## Enforcement`, `## What was ruled out`,
`## Verified facts (ADR-014)`.
### 4. Status lifecycle
Four states. Because boma is single-contributor and trunk-based with no review gate,
most ADRs are **born `Accepted (YYYY-MM-DD)`** — committed-to on writing. A
**`Proposed`** state exists for a genuine draft whose core direction is recorded but
whose specifics are still open for discussion (e.g. ADR-011); it is promoted to
`Accepted` once settled.
- **`Proposed (YYYY-MM-DD)`** — drafted, under discussion, not yet committed-to. May
carry open questions. Promoted to `Accepted (YYYY-MM-DD)` when decided.
- **`Accepted (YYYY-MM-DD)`** — committed-to. The common starting state.
- Replaced → old ADR's Status becomes **`Superseded by ADR-NNN (YYYY-MM-DD)`**; the new
ADR records `Supersedes ADR-MMM` in its Status and `## Related`. The link is
**bidirectional**.
- Retired with no replacement → **`Deprecated (YYYY-MM-DD)`** + a one-line reason.
**No silent rewrites.** An Accepted ADR is not edited to reverse its decision. Typo and
clarity fixes are fine; a material reversal requires a new ADR and a `Superseded by`
marker on the old one.
### 5. Template & enforcement
`docs/decisions/adr-template.md` is the scaffold for new ADRs. The `/review-repo`
command's pre-scan (`scripts/repo-scan.py`) emits an `adr-structure` finding for any
numbered ADR missing a mandatory section or with an unparseable Status line. It checks
**presence and Status, not section order** — order is a convention the template carries,
deliberately not gated, to keep enforcement lightweight (consistent with boma's other
doctrine ADRs adding no CI gate).
### 6. Retroactive conformance of the back-catalogue
ADRs 001018 are restructured to satisfy this standard rather than grandfathered. The
restructure is **presentational** — existing headings are relabelled, regrouped, or
demoted under a `## Decision` umbrella; a dated `## Status` is added; a `## Consequences`
section is assembled from implications the ADR already states. **The substance of no
decision is changed.** This keeps the check uniform (no number threshold) and the corpus
a consistent, legible decision history.
## Consequences
- New ADRs have one obvious shape and a scaffold; structural drift stops.
- Every ADR declares its lifecycle state uniformly, and reversals are traceable.
- The whole corpus conforms; the check needs no grandfathering and stays simple.
- One-time restructure churn across ADRs 001018 (heading reorganization + a Status and
a Consequences section per file; no decision substance changed).
- `/review-repo` grows one deterministic check; no new CI machinery.
- This ADR is the first conformant example and is held to its own check.
## What was ruled out
- **A `make lint` / CI gate for ADR structure** — heavier than the risk warrants;
the `/review-repo` check and the template suffice.
- **Machine-enforcing section order** — brittle for marginal value; left as a
template-demonstrated convention.
- **Grandfathering 001018 from the check** — rejected in favour of restructuring the
whole corpus to conform, so the standard applies uniformly with no exceptions.
## Related
- ADR-014 — knowledge sourcing (the `Verified facts` optional section).
- ADR-019/020/021/022 — the emergent structure this ADR codifies.
- `docs/decisions/adr-template.md` — the scaffold.
- `scripts/repo-scan.py` — the `adr-structure` enforcement check.

View file

@ -0,0 +1,145 @@
# ADR-024 — Reverse proxy: Caddy (ACME — HTTP-01 public, DNS-01 private)
## Status
Accepted (2026-06-14; DNS-01 path resolved + proven 2026-06-15). Amends the soft
Traefik assumption carried by the roadmap (Phase-2 step 5) and ADR-017 prose; those
are updated to read "Caddy (ADR-024)".
> **Cert method follows exposure.** The cert *challenge* depends on whether a host is
> publicly reachable: **public hosts** (askari) use **HTTP-01** with **vanilla Caddy**
> simplest, no plugin; **mesh/LAN-only cluster services** (no public A-record) use
> **DNS-01** via Gandi (the M1 capability), since they can't satisfy HTTP-01.
>
> **DNS-01 resolved + proven (2026-06-15) — the M4a deferral is closed.** The original
> failure was diagnosed as **version skew**: the image built at M4a used a pre-Bearer
> `libdns/gandi` that sent Gandi's **deprecated `Apikey` header** (→ 403 on a
> verified-valid token), and the `xcaddy` build ran *on a Hetzner IP* (Google's Go
> module proxy 403s those ranges). Both have clean, boma-aligned fixes: **pin
> caddy-dns/gandi v1.1.0** (→ `libdns/gandi` v1.1.0, which sends the PAT as
> `Authorization: Bearer` to `https://api.gandi.net/v5/livedns`) and **build the image
> on ubongo, not Hetzner**. Verified end-to-end (2026-06-15): the custom image issues a
> real **wildcard** cert (`*.dns01test.wingu.me`) against Let's Encrypt **staging** via
> Gandi DNS-01 using `vault.gandi.pat`; `caddy validate` accepts `acme_dns gandi` on the
> custom image and rejects it on vanilla `caddy:2`. Build with `make caddy-image`; the
> `reverse_proxy` role enables it per-instance via `reverse_proxy__acme_dns_provider:
> gandi` + `reverse_proxy__image`. **Traefik was reconsidered and rejected again**
> lego's Gandi provider faces the *same* PAT-vs-Apikey question, so switching would not
> have dodged the issue, and would reverse this ADR for nothing. askari (M4a) stays on
> HTTP-01 (a public host needs no DNS-01).
## Context
boma needs a reverse proxy to front its services with TLS. ADR-002 requires every
service to sit behind a proxy with authentication before it is reachable; ADR-007/M1
delivers a `*.<domain>` wildcard cert via ACME DNS-01 against Gandi (the apex `boma`
domain, matching ROADMAP M1) — the only viable cert path for mesh/LAN-only services
that cannot satisfy HTTP-01 (no public A-record to point at).
The roadmap (Phase-2, step 5) and ADR-017 prose assumed **Traefik + Authentik** as the
auth-and-proxy pair without an ADR ever pinning Traefik. On closer inspection:
- Traefik's headline feature is **dynamic Docker-label discovery** — it discovers and
routes services automatically from container labels without any static config.
- boma already renders *all* config from Ansible templates and the `group_vars` catalog
(ADR-004). That makes dynamic label discovery a disadvantage: a service that is not in
the catalog does not exist (CLAUDE.md), so any route that Traefik auto-discovers
outside the catalog would be unaudited.
- The first reverse-proxy instance is needed on `askari` for M4 (NetBird), a host where
`docker_hosts` patterns are being established under off-site/VPS constraints, not a
full Proxmox cluster with many services.
No production investment in Traefik config has been made; the decision can be made
cleanly here.
## Decision
boma's reverse proxy is **Caddy**.
### 1. Rationale for Caddy over Traefik
1. Traefik's dynamic label discovery is wasted — boma renders config from the catalog;
Caddy's static Caddyfile maps naturally to "render from templates" (ADR-004).
2. Caddy's Caddyfile is simple to template with `ansible.builtin.template`; one file,
one `ansible_managed` header, no side-channel label state.
3. **Automatic HTTPS** via ACME DNS-01: the `caddy-dns/gandi` plugin satisfies the
Gandi DNS-01 challenge, which is the only cert path for services with no public
A-record (ADR-007/M1 wildcard strategy).
4. Far simpler for a solo operator: no dashboard-as-a-service, no routing-rule DSL,
no dynamic config files to reconcile.
5. `forward_auth` to Authentik is a first-class Caddy directive — the planned
Authentik auth story (ADR-002) is preserved without Traefik as the middleman.
### 2. Custom image (DNS-01 path — built)
> Applies only to the **DNS-01** path. M4a ships **vanilla `caddy:2`** on askari
> (HTTP-01) — no custom image; only DNS-01 hosts pull the custom one.
Caddy's official Docker image does not include third-party DNS plugins. The
`caddy-dns/gandi` plugin must be compiled in via `xcaddy`. boma builds a custom image
(`.docker/caddy-gandi/Dockerfile`, `make caddy-image`), **pinned** (ADR-011/ADR-014):
```dockerfile
FROM caddy:2.11.4-builder AS build
RUN xcaddy build v2.11.4 --with github.com/caddy-dns/gandi@v1.1.0
FROM caddy:2.11.4
COPY --from=build /usr/bin/caddy /usr/bin/caddy
```
Two hard constraints, both learned from the M4a failure:
1. **Build on ubongo, not Hetzner.** Google's Go module proxy 403s Hetzner IP ranges, so
the on-host build on askari failed. ubongo (the control node) builds it in ~1 min,
then it is pushed to the Forgejo registry (`make caddy-image-push`) and pulled by
DNS-01 hosts — the same artifact pattern as the Molecule image.
2. **Pin a Bearer-capable plugin.** caddy-dns/gandi v1.1.0 → libdns/gandi v1.1.0 sends
the PAT as `Authorization: Bearer`. Older versions used the deprecated `Apikey`
header and 403 on a PAT — that was the M4a "valid token but no TXT record" symptom.
### 3. Deployment scope
The first Caddy instance runs on `askari` (M4a), serving a test vhost over HTTP-01 to
prove the proxy + ACME path. It fronts the NetBird stack in **M4b** (when the
`netbird_coordinator` role is built). The pattern generalises to the Proxmox cluster in
Phase 2 when services multiply.
### 4. Authentik integration (deferred)
`forward_auth` to Authentik is deferred to Phase 2 (when Authentik is deployed on the
cluster). The Caddyfile template will carry a placeholder comment. No Traefik-Authentik
middleware migration is required.
## Consequences
- **Roadmap Phase-2 step 5** is updated from "Authentik + Traefik" to "Authentik +
Caddy (ADR-024)".
- **ADR-017 prose** that mentioned Traefik is updated to read "Caddy (ADR-024)".
- M4a (public hosts, HTTP-01) runs **vanilla `caddy:2`** — no custom image. The DNS-01
custom Caddy image (`xcaddy` + `caddy-dns/gandi`, `.docker/caddy-gandi/`) is **built and
proven**; it must be pushed to the Forgejo registry (`make caddy-image-push`, needs
`docker login`) and kept current (plugin + base-image version bumps, pinned per
ADR-011/ADR-014) as DNS-01 cluster services come online.
- Caddyfile config is rendered by Ansible from `group_vars` — consistent with ADR-004
and easier to review than distributed container labels.
- `forward_auth` to Authentik is available when Authentik is deployed; no extra
middleware layer required.
- The `proxy` concern tag (already in `tests/tags.yml`) covers Caddy config tasks.
## What was ruled out
- **Traefik** — dynamic label discovery is a mismatch for boma's catalog-rendered
config model (ADR-004); more complex for a solo operator; no prior investment to
protect.
- **nginx / HAProxy** — no built-in ACME; require a separate ACME client (certbot,
acme.sh) adding operational surface; Caddy's integrated ACME is simpler.
- **NetBird's bundled TLS** — NetBird's management UI can serve its own TLS, but that
doesn't generalise; a real proxy separates concerns and applies to every service.
## Related
- ADR-002 — services behind a proxy with authentication (the requirement this satisfies).
- ADR-004 — Docker & Compose model (template-rendered config, catalog-driven).
- ADR-007 / M1 — Gandi DNS-01 ACME path (the TLS strategy Caddy implements).
- ADR-016 — NetBird (M4 is the first deployment of this proxy).
- ADR-017 — service-UI verification; forward_auth to Authentik is the future auth story.

View file

@ -0,0 +1,180 @@
# ADR-025 — Local VM integration testing on ubongo
## Status
Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now
viable on ubongo). **RED→GREEN acceptance PASSED on real hardware (2026-06-18):** a
throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base's nftables forward
default-deny kills Docker forwarding on reboot) — RED — and survived the reboot once
the `docker_host` container-forward drop-in was applied — GREEN. Two shakedown
learnings added below.
## Context
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one
`converge`, no real kernel netfilter, no real Docker daemon in the loop, and **no
reboot**. That structurally cannot catch an entire class of bug — reboot-survivability,
host-firewall × Docker interaction, and boot-ordering — which is exactly the class
that caused the **2026-06-17 mesh-hardening incident**.
During that incident, `base`'s nftables `forward { policy drop; }` killed the askari
Docker host **on reboot**: nftables loaded its default-deny before Docker, breaking
published-port DNAT and inter-container forwarding. Public services and the mesh went
down. It had worked right after `make deploy`, when Docker's runtime rules still
coexisted. `ip_nonlocal_bind` also failed to beat the sshd boot-race, leaving the mesh
listener absent at boot. Recovery required the Hetzner console and a WAN-SSH
break-glass. Molecule had passed.
ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral:
> verified: ubongo KVM capability · Bash (2026-06-18 session) · `/dev/kvm` present +
> accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM
> free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant **not yet installed** ·
> 2026-06-18.
## Decision
### 1. Virtualisation approach: libvirt/KVM directly (Approach A)
A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an
ephemeral qcow2 **overlay** backed by it (the golden image is never mutated), seeded
via cloud-init NoCloud, driven by a **stdlib-only** Python driver (`scripts/
integration-vm.py`) over `virsh` / `virt-install` / `cloud-localds`. No `libvirt-
python` dependency — the driver stays portable and the role stays lean.
### 2. Fidelity envelope
The bugs are **post-boot**, not in the provisioning path. A lightweight local hypervisor
is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port
DNAT, a **real reboot**, and the coordinator running inside the VM (so the VM forms its
own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome
is not mirrored.
### 3. Scope: one throwaway VM at a time, instantiated from real inventory
The first profile is **"be askari"** — a single box running Docker host + NetBird
coordinator + mesh peer, mirroring the host whose incident motivates this work. The
mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies
are a deferred extension.
### 4. Acceptance: self-validating against the real failure
The harness is accepted when it can, on a local VM:
1. Apply `base` (firewall on, no `docker_host` container-forward drop-in) to a Docker
host, reboot, and observe the **2026-06-17 breakage** (Docker forwarding dead,
services down). If step 1 passes, the harness is not faithful.
2. Apply the `docker_host` container-forward fix, re-run, and **survive the reboot**.
### 5. Tiered cert fidelity via a `--certs` knob
DNS-01 is what makes real certs possible without public inbound (validation is
out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which
the isolated NAT network provides):
| Tier | Description | Default? |
|---|---|---|
| `internal` | Caddy `tls internal` — zero deps, instant. For incident repro and runs where certs are not under test. | Yes |
| `le-staging` | Real DNS-01 ACME against Let's Encrypt **staging** — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. |
| `le-prod-wildcard` | A real trusted `*.test.wingu.me` wildcard, **issued once, persisted on ubongo, reused** across runs. | On-demand only. Accepted risk recorded as R6 in `docs/security/accepted-risks.md`. |
A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4
`netbird-server` FATAL-loops on GeoLite2 download when egress is lost) forces
`internal`, since ACME requires egress.
### 6. The toolchain is Ansible-managed
A new non-service role (`integration_test`, `control` group) installs and enables
libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on
first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The
repo owns ubongo's state.
### 7. Stubs live in an overlay file, never in the real inventory
Transient inventory entries for the test VM are generated at runtime as a single-host
file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in
`tests/integration/overrides/<host>.yml` — an explicit, reviewable overlay. The real
inventory is never touched, so `make tf-inventory` and "don't edit inventory directly"
stay intact.
## Consequences
- **Reconciles ADR-015:** ubongo runs ephemeral KVM test VMs as part of its
local-test-runner role — it is still not a production hypervisor. A default VM
(~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the
driver enforces **one integration VM at a time** (resource guard, name-prefix
`boma-it-*`) and refuses to start below a free-RAM threshold.
- **Operationalises the standing rule:** "firewall/sshd/boot changes must be tested on
a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6)
becomes a concrete, runnable step documented in `docs/runbooks/integration-testing.md`.
- **Accepted risk R6:** `le-prod-wildcard` runs pass the production Gandi PAT
(`vault.gandi.pat`) to an ephemeral local VM and write transient `_acme-challenge`
TXT records into the real `wingu.me` zone. Scope: on-demand only; `le-staging` is the
default. Compensating controls: ephemeral VM, isolated NAT network, TXT records
auto-removed by Caddy after validation.
- **Three safety invariants** make the test tool itself safe:
1. The transient inventory contains only the test VM — no real host is ever in scope.
2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node
mesh; it never enrols in the real mesh.
3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls
only, not reachable to the LAN (`10.20.x`) or the real mesh.
- **Diagnostics on failure** (catching a bug is the point): failure keeps the VM and
dumps `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`,
`systemd-analyze critical-chain`. `make test-integration-clean` reaps all `boma-it-*`
orphans. Diagnostics land in gitignored `~/integration-runs/<ts>-<host>/`.
- **Future pinch:** concurrency with the Level-4 Chromium/Playwright stack (ADR-017)
competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a
time; don't run alongside a heavy Level-4 session. Revisit at `/capacity-review`.
## Scope
**In scope:** reboot-survivability, host-firewall × Docker interaction, boot-ordering,
cert/ACME paths, mesh bootstrap on one box.
**Out of scope (v1):** multi-VM mini-cluster (inter-host mesh dataplane); CI gate
(this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per
ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker
layer, not provisioning).
## What was ruled out
| Option | Reason |
|---|---|
| **Proxmox VE nested on ubongo** | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. |
| **Vagrant + vagrant-libvirt** | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. |
| **terraform-provider-libvirt** | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns *production* VM existence on Proxmox" boundary. |
## Verified facts (ADR-014)
- verified: ubongo KVM capability · Bash · `/dev/kvm` present + accessible (kvm group),
Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB
disk free · 2026-06-18.
## Shakedown learnings (2026-06-18 live run)
Two findings from the RED→GREEN acceptance run that affect anyone operating the harness:
1. **Boot firmware: UEFI required.** The Debian 13 genericcloud image triple-faults
under legacy BIOS/SeaBIOS and does not reach the kernel. Boot the VM with UEFI
(`virt-install --boot uefi`; `ovmf` package). The driver does this by default; note
it here so the requirement is findable.
2. **`claude` sudo is load-bearing.** VM management (`virsh`, `virt-install`,
`cloud-localds`) and offline diagnostics (`nft list ruleset`, `journalctl -b`,
`systemd-analyze critical-chain`) all require root. The harness assumes the
AI-worker has `NOPASSWD:ALL` sudo on `ubongo` — settled as the ADR-015 amendment
(2026-06-18) and registered as R7 in `docs/security/accepted-risks.md`. A `claude`
account without sudo will block the harness at the first `virsh` call.
The nine full shakedown findings (including the UEFI boot-loop) are in
`docs/FRICTION.md`.
## Related
- ADR-006 — Terraform owns production VM existence (boundary this ADR respects).
- ADR-008 — Testing methodology (Levels 14); this ADR is the concrete build of Level 2/3.
- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs. **Supersedes** ADR-015's "no local sudo" sub-decision for the AI-worker — the shakedown necessitated `claude` NOPASSWD sudo (ADR-023 §4; access model in ADR-021, risk R7).
- ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role.
- ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests.
- ADR-021 — Operational access; sudo model for `claude` and `sjat` on `ubongo`.
- ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.

View file

@ -0,0 +1,40 @@
# ADR-NNN — <Title>: <optional clarifying subtitle>
<!-- Filename: NNN-kebab-title.md (zero-padded, monotonic, never reused).
Register a row in CLAUDE.md "Further reading" when this ADR is created.
Sections below in order. Mandatory: Status, Context, Decision, Consequences.
Delete this comment and any optional section you don't use. -->
## Status
Accepted (YYYY-MM-DD)
<!-- Lifecycle: usually born "Accepted (YYYY-MM-DD)"; use "Proposed (YYYY-MM-DD)" for a
genuine draft (open questions), promoted to Accepted once settled. Later:
"Superseded by ADR-NNN (YYYY-MM-DD)" or "Deprecated (YYYY-MM-DD)" + one-line why.
Optional trailing note OK, e.g.
"Accepted (2026-06-10). Doctrine ADR — pins policy, builds nothing yet." -->
## Context
<!-- The forces, the problem, what exists today, why now. -->
## Decision
<!-- What we are doing. Use numbered sub-decisions (### 1. ...) for multi-part ADRs. -->
## Consequences
<!-- Results, trade-offs explicitly accepted, follow-on work. -->
<!-- Optional sections — uncomment any that genuinely apply; never pad:
## Scope — explicit in / out-of-scope boundaries.
## Guardrails — how the decision is mechanically enforced (lint, CI, hooks).
## What was ruled out — rejected alternatives, each with its reason.
## Verified facts (ADR-014) — verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>
## Related — links to other ADRs by number; bidirectional for Supersedes/Superseded-by.
-->

View file

@ -18,6 +18,25 @@
- **NICs:** _eno1 trunk (vmbr0), eno2 corosync (vmbr1)_
- **Notes:** _warranty, quirks_
### ubongo (control node — outside the cluster)
- **Model / form factor:** Lenovo ThinkCentre M70q Tiny (machine type 11DUS7XP00); 1-litre tiny/USFF
- **CPU:** Intel Core i3-10100T — 4 cores / 8 threads, 35 W TDP
- **RAM:** 16 GB DDR4-3200 (2×8 GB SODIMM)
- **Storage:** 256 GB SanDisk X600 SATA 2.5" SSD (model SD9TB8W256G1001; TCG Opal-capable, Opal unused — no disk encryption)
- **NICs:** wired GbE, interface eno1, MAC 88:a4:c2:e0:ee:da
- **BIOS:** Lenovo M2WKT5AA (2023-06-20)
- **Notes:** always-on; control plane + AI-worker (dedicated `claude` user) + local test runner (Molecule/Docker) per ADR-015; not a Proxmox guest; remote access currently LAN SSH only (mesh deferred). Also runs **one ephemeral KVM integration test VM** (~3 GiB RAM) at a time per ADR-025 — the resource guard enforces one-at-a-time; do not run a test-integration cycle alongside a heavy Level-4 browser session (Chromium/Playwright).
### fisi (backup node — outside the cluster; provisional)
- **Model / form factor:** HP Elite 600 G9 (tower)
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
- **RAM:** 16 GB+ (TBD exact)
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
- **NICs:** wired GbE
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
_(repeat for pve1, pve2, askari)_
## 2. Network gear
@ -46,6 +65,8 @@ Physical totals per node. Integers; `ram_gb` and `disk_gb` may be decimals.
|------|-------|--------|---------|
| pve0 | 20 | 64 | 4000 |
| pve1 | 20 | 64 | 4000 |
| ubongo | 4 | 16 | 250 |
| fisi | 4 | 16 | 8000 |
## 5. Capacity notes

View file

@ -0,0 +1,88 @@
{
"date": "2026-06-05",
"reviewed_commit": "f566fd1",
"fixes_commit": "666ad42",
"mode": "on-demand",
"counts": {
"auto_fixed": 4,
"open": 12,
"scan": {"broken-path-ref": 14, "marker": 35, "open-deferred-item": 6, "stale-deferred": 0}
},
"auto_fixed": [
{"id": "AF1", "dimension": "consistency", "severity": "high",
"location": "docs/decisions/005-bootstrapping.md:36; docs/runbooks/new-host.md:62,71",
"description": "Terraform 'writes the host's DNS A record' contradicts ADR-009 (dns role owns the zone)",
"fix": "removed the DNS-write clause; noted Terraform writes no DNS records",
"tag": "recurring"},
{"id": "AF2", "dimension": "consistency", "severity": "high",
"location": "docs/decisions/005-bootstrapping.md:8",
"description": "control node described as cloned from the cloud-init template; ADR-015 makes ubongo physical",
"fix": "control node is a physical box installed directly, not cloned (ADR-015)",
"tag": "new"},
{"id": "AF3", "dimension": "consistency", "severity": "low",
"location": "CLAUDE.md:197",
"description": "Further reading missing the VERIFY.md template row",
"fix": "added docs/testing/service-verify-template.md row",
"tag": "new"},
{"id": "AF4", "dimension": "cruft", "severity": "low",
"location": "docs/TODO.md:79",
"description": "typos: 'we we', 'seperate'",
"fix": "corrected to 'we' and 'separate'",
"tag": "new"}
],
"open": [
{"id": "O1", "dimension": "consistency", "severity": "medium",
"location": "docs/decisions/004-docker-model.md",
"description": "service-role standard file table lists SECURITY.md but not VERIFY.md (ADR-017/CLAUDE.md:85 mandate it)",
"suggested_fix": "add a VERIFY.md row to ADR-004's file table", "tag": "new"},
{"id": "O2", "dimension": "consistency", "severity": "medium",
"location": "docs/runbooks/new-role.md",
"description": "no step to write VERIFY.md for service roles; STATUS.md:17 'runbooks reconciled' now overstated",
"suggested_fix": "add a VERIFY.md step mirroring the SECURITY.md step", "tag": "new"},
{"id": "O3", "dimension": "cruft", "severity": "low",
"location": "README.md:58-60,94",
"description": "ADR list stops at 001-009; docs/ tree omits security/, testing/, hardware/",
"suggested_fix": "extend ADR list + docs/ subtree", "tag": "new"},
{"id": "O4", "dimension": "consistency", "severity": "medium",
"location": "CLAUDE.md:106; docs/decisions/009-provisioning-handoff.md:78; scripts/tf_to_inventory.py:24",
"description": "ADR-016 says askari gets its own inventory group but none is named; valid-groups set excludes it",
"suggested_fix": "name the group; add to host-groups + ADR-009 valid groups", "tag": "new"},
{"id": "O5", "dimension": "consistency", "severity": "medium",
"location": "docs/decisions/006-terraform.md:78",
"description": "backend.tf labelled 'Forgejo state backend' contradicts ADR-006's own local-state section",
"suggested_fix": "relabel to local state backend (no remote backend)", "tag": "new"},
{"id": "O6", "dimension": "drift", "severity": "medium",
"location": "docs/decisions/014-knowledge-sourcing.md:88",
"description": "plugin reproducibility described as open, but TODO 10.7 is DONE",
"suggested_fix": "update to resolved state; drop the forward-pointer", "tag": "new"},
{"id": "O7", "dimension": "consistency", "severity": "low",
"location": "docs/decisions/011-update-management.md:128",
"description": "ruled-out 'Digest-pinning the stateful tier' contradicts Decision #2 (adopts tag@digest); ADR-011 is draft",
"suggested_fix": "remove/replace the ruled-out row when accepting ADR-011 (TODO 16)", "tag": "new"},
{"id": "O8", "dimension": "consistency", "severity": "low",
"location": "docs/decisions/003-toolchain.md:85; docs/decisions/010-forgejo-ci.md:66",
"description": "'act_runner on control node or a dedicated runner VM' ambiguous vs ADR-015",
"suggested_fix": "name ubongo as runner host; cross-ref ADR-015", "tag": "new"},
{"id": "O9", "dimension": "consistency", "severity": "low",
"location": "docs/decisions/008-testing.md:148",
"description": "WireGuard Molecule-exclusion row framed for retired OPNsense VLAN-99 WireGuard",
"suggested_fix": "reframe to NetBird wt0 data plane (ADR-016)", "tag": "new"},
{"id": "O10", "dimension": "consistency", "severity": "low",
"location": "docs/decisions/011-update-management.md:67",
"description": "cross-refs 'scheduled_jobs plan and ADR-010'; ADR-010 has no such plan (TODO 8.3)",
"suggested_fix": "point to TODO 8.3", "tag": "new"},
{"id": "O11", "dimension": "consistency", "severity": "low",
"location": "docs/CAPABILITIES.md",
"description": "no row for the /verify-service (Level 4) capability decided in ADR-017",
"suggested_fix": "add an Operations row for /verify-service", "tag": "new"},
{"id": "O12", "dimension": "cruft", "severity": "low",
"location": "docs/TODO.md:30",
"description": "item 3.10 is garbled/unfollowable",
"suggested_fix": "rewrite clearly or strike", "tag": "new"}
],
"scan_noise": [
"broken-path-ref x14: illustrative report-name templates (YYYY-MM-DD-<service>.md) and not-yet-created latest.md files; scanner stops at the <placeholder> boundary",
"marker x35: mostly prose references to TODO.md items, not code markers",
"open-deferred-item x6: all confirmed genuinely open (ADR-011 #1-5, ADR-015 #3); 0 stale-deferred"
]
}

View file

@ -0,0 +1,93 @@
# Repo review — 2026-06-05
- **Reviewed commit:** `f566fd1` (scan); auto-fixes landed in `666ad42`
- **Mode:** on-demand (interactive)
- **Scope:** whole repo — 2 roles, 17 ADRs, 4 runbooks, 7 scripts; doc-heavy
- **Prior run:** 2026-05-30 (`de38d1c`) — 7 auto-fixed, 17 open
## Summary
| | high | medium | low | total |
|---|---|---|---|---|
| Auto-fixed | 2 | 0 | 2 | 4 |
| Open (report-only) | 0 | 5 | 7 | 12 |
This review followed a session of heavy documentation work (ADR-015 `ubongo`,
ADR-016 NetBird mesh, ADR-017 Level-4 verification). Most findings are **propagation
gaps** — a new decision landed but an older doc still reflects the prior design.
**New deferral check exercised.** `repo-scan.py` now enumerates open ADR "Deferred/
Open" items and flags any another file calls resolved-but-unmarked. This run: 6
open-deferred-items surfaced, **all confirmed genuinely open** by the cross-cutting
reviewer (ADR-011 #15, ADR-015 #3), **0 stale-deferred**. The check produced no false
resolutions and the judgement layer agreed — working as designed.
## Auto-fixes applied (`666ad42`)
| id | dim | sev | location | fix |
|---|---|---|---|---|
| AF1 | consistency | high | `docs/decisions/005-bootstrapping.md:36`, `docs/runbooks/new-host.md:62,71` | Removed "Terraform writes the host's DNS A record" — contradicts ADR-009 (the `dns` role owns the zone). **Recurring**: the 2026-05-30 run fixed the same contradiction in README/ADR-003; it reappeared in two more files. |
| AF2 | consistency | high | `docs/decisions/005-bootstrapping.md:8` | Control node described as cloned from the cloud-init template; ADR-015 makes `ubongo` a physical box installed directly. Corrected. |
| AF3 | consistency | low | `CLAUDE.md:197` | Added the missing `docs/testing/service-verify-template.md` row to Further reading (parallels the security-template row). |
| AF4 | cruft | low | `docs/TODO.md:79` | Typos: "we we" → "we"; "seperate" → "separate". |
## Open findings (report-only)
### VERIFY.md propagation cluster (ADR-017 not fully threaded through)
| id | sev | location | finding | suggested fix |
|---|---|---|---|---|
| O1 | medium | `docs/decisions/004-docker-model.md` (file table) | The service-role standard lists `SECURITY.md` but not `VERIFY.md`, though ADR-017 + CLAUDE.md:85 now mandate it. | Add a `VERIFY.md` row to ADR-004's file table. |
| O2 | medium | `docs/runbooks/new-role.md` (step 9 → Commit) | No step to write `VERIFY.md` for service roles (only `SECURITY.md`). Makes `STATUS.md:17` ("runbooks current and mutually reconciled") slightly overstated. | Add a "write the per-service verification spec" step mirroring the SECURITY.md step. |
| O3 | low | `README.md:58-60, 94` | ADR list stops at 001009 (010017 absent); the `docs/` tree omits `security/`, `testing/`, `hardware/`. | Extend the ADR list (or point to `docs/decisions/` + CLAUDE.md's table); expand the `docs/` subtree. |
### Design gaps from the recent ADRs
| id | sev | location | finding | suggested fix |
|---|---|---|---|---|
| O4 | medium | `CLAUDE.md:106`, `docs/decisions/009-provisioning-handoff.md:78`, `scripts/tf_to_inventory.py:24` | ADR-016 says "`askari` is Ansible-managed — its own inventory group", but no group is named anywhere; host-groups list + valid-groups set don't include it. | Decide the group name (e.g. `edge_hosts`/`hetzner_hosts`), add to CLAUDE.md host groups + ADR-009 valid groups. (`askari` is manual like the control node, so `tf_to_inventory.py` need not generate it, but the group must be valid.) |
| O5 | medium | `docs/decisions/006-terraform.md:78` | `backend.tf` labelled "Forgejo state backend", contradicting ADR-006's own State-backend section (local state on `ubongo`; Forgejo's API is read-only). | Relabel to "local state backend (no remote backend)". |
| O6 | medium | `docs/decisions/014-knowledge-sourcing.md:88` | Plugin-reproducibility described as open ("tracked in `docs/TODO.md`"), but TODO 10.7 is marked DONE (settings.json declares the plugin set; claude-code-setup.md covers bootstrap). | Update to reflect the resolved state; drop the forward-pointer. |
### Clarity / lower-priority consistency
| id | sev | location | finding | suggested fix |
|---|---|---|---|---|
| O7 | low | `docs/decisions/011-update-management.md:128` | "Digest-pinning the stateful tier" sits in the ruled-out table, but Decision #2 *adopts* `tag@digest` for stateful (TODO 16 confirms). ADR-011 is still **Proposed/draft**. | Remove/replace the ruled-out row when accepting ADR-011 (TODO 16). |
| O8 | low | `docs/decisions/003-toolchain.md:85`, `docs/decisions/010-forgejo-ci.md:66` | "act_runner on the control node **or a dedicated runner VM**" reads ambiguously against ADR-015 (no cluster control VM). Not wrong (a runner VM is a separate option) but worth disambiguating. | Name `ubongo` as the runner host; cross-ref ADR-015; keep "dedicated runner VM" as an explicit future option. |
| O9 | low | `docs/decisions/008-testing.md:148` | The "WireGuard tunnel establishment" Molecule-exclusion row is framed for the retired OPNsense VLAN-99 WireGuard; NetBird still uses WireGuard (`wt0`) as its data plane. | Reframe the row to the NetBird `wt0` data-plane (ADR-016). |
| O10 | low | `docs/decisions/011-update-management.md:67` | Cross-references "the `scheduled_jobs` plan and ADR-010"; ADR-010 is Forgejo CI, not scheduled jobs (that's TODO 8.3, unbuilt). | Point to TODO 8.3 instead. |
| O11 | low | `docs/CAPABILITIES.md` §10 | No row for the `/verify-service` (Level 4) capability though ADR-017 decided it. | Add an Operations row for `/verify-service`. |
| O12 | low | `docs/TODO.md:30` (item 3.10) | Garbled text ("maybe something in the improvements of the methods in boma moods the point?") — unfollowable. | Rewrite the question clearly or strike it. |
### Deterministic-scan noise (not fixed — known limitations)
- **`broken-path-ref` ×14** — all illustrative/future paths: report-name templates
(`docs/testing/reviews/YYYY-MM-DD-<service>.md`) and `latest.md` files not yet
created. The path-ref check stops at the `<placeholder>` boundary, so a templated
path registers as a partial broken ref. *Potential scanner improvement: skip a path
ref immediately followed by a placeholder char or a `YYYY-MM-DD` token.*
- **`marker` ×35** — mostly prose references to `TODO.md` items, not code markers.
Known noise; the regex already excludes `TODO.md`/alternations but not "TODO 8.2"
prose.
- **`open-deferred-item` ×6** — all confirmed genuinely open (see above). `0`
stale-deferred. New check healthy.
## Diff vs prior run (2026-05-30)
- **Recurring:** the Terraform-writes-DNS contradiction (AF1) — fixed in README/ADR-003
last run, reappeared in ADR-005/new-host.md. Signal that this phrasing keeps being
copied; worth a `/review-repo`-time grep for "writes … DNS A record".
- **New:** everything else — the repo gained ADR-010…017 and the `ubongo`/NetBird/
Level-4 work since the prior run, so most findings are fresh propagation gaps.
- **Resolved:** prior-run open items were largely addressed during the intervening
doc work (control-node-as-VM, WireGuard framing, etc., now mostly reconciled).
## Follow-up prompt
> Thread the ADR-017 `VERIFY.md` convention through the remaining docs (O1O3): add a
> `VERIFY.md` row to ADR-004's service-role file table, a VERIFY.md step to
> `new-role.md` (and reconcile STATUS.md:17), and refresh `README.md`'s ADR list +
> `docs/` tree. Then settle the `askari` inventory group name (O4) and propagate it to
> CLAUDE.md host-groups + ADR-009 valid-groups. Finally clear the stale labels O5
> (ADR-006 backend.tf) and O6 (ADR-014 plugin reproducibility = DONE).

View file

@ -0,0 +1,65 @@
{
"date": "2026-06-11",
"reviewed_commit": "67f2aba",
"fixes_commit": null,
"mode": "on-demand",
"counts": {
"auto_fixed": 5,
"open": 18,
"scan": {
"broken-adr-ref": 4,
"broken-path-ref": 1,
"marker": 14,
"open-deferred-item": 5,
"stale-deferred": 0
}
},
"deferral_checklist": {
"adr-011-open-items": "all 5 (snapshot driver, cadences, health-check harness home, classification home, staging-first) confirmed genuinely still open; cross-checked against later ADRs + TODO 16. No stale-deferred.",
"adr-015-deferred": "deferred #1 (mesh VPN) #2 (service-UI) #3 (build) all confirmed marked RESOLVED in place. No stale-deferred.",
"stale_deferred_found": 0
},
"scan_false_positives": [
{"check": "broken-path-ref", "location": "STATUS.md:38", "why": "STATUS legitimately documents roles/docker_host/ as 'Not in git.' — intentional reference to an unbuilt role."},
{"check": "broken-adr-ref", "location": "tests/test_repo_scan.py:10,43; docs/superpowers/plans/2026-06-10-adr-structure.md:50,83", "why": "ADR-099/ADR-100 are intentional test fixtures exercising the scanner's bad-ref detection."},
{"check": "marker", "location": "docs/superpowers/plans/*, docs/superpowers/specs/*, docs/decisions/019-tagging.md:14", "why": "All 14 markers are in historical planning artifacts (commit-message TODOs, plan steps) or prose discussing 'over-tagging' as a concept — not actionable cruft."}
],
"auto_fixed": [
{"id": "AF1", "dimension": "drift", "severity": "high", "location": "roles/README.md:11-13", "description": "'base and docker_host not built yet — empty, untracked dirs, so site.yml would fail on a clean clone' contradicts STATUS.md: base is partially built (firewall concern, tracked), docker_host does not exist, dev_env is built+applied.", "fix": "rewrote Current-state paragraph: base partially built (firewall), docker_host not yet created, dev_env built+applied.", "tag": "new"},
{"id": "AF2", "dimension": "drift", "severity": "medium", "location": "playbooks/site.yml:4-5", "description": "NOTE claimed base + docker_host 'not built yet ... fails on a clean clone'; base's firewall concern is built+applied per STATUS.md.", "fix": "NOTE now states base is partially built (firewall) and only docker_host is missing.", "tag": "new"},
{"id": "AF3", "dimension": "drift", "severity": "medium", "location": "playbooks/README.md:6-8", "description": "site.yml described as 'currently a no-op' (roles empty); base's firewall now applies real nftables state. workstation.yml (applies dev_env) was unlisted.", "fix": "reworded the no-op claim and added a workstation.yml bullet.", "tag": "new"},
{"id": "AF4", "dimension": "drift", "severity": "low", "location": "README.md:58-76", "description": "project-structure tree omitted docs/access/, docs/backup/, roles/dev_env/, and playbooks/workstation.yml — all present on disk.", "fix": "added the four missing tree entries.", "tag": "recurring"},
{"id": "AF5", "dimension": "consistency", "severity": "low", "location": "docs/decisions/016-mesh-vpn.md:110; docs/decisions/020-firewall.md:135", "description": "ADR-021 states it amends ADR-016 and ADR-020 to cross-reference the SSH ladder, but neither listed ADR-021 back in its See-also/Related section.", "fix": "added the reciprocal ADR-021 cross-reference to both.", "tag": "new"}
],
"open": [
{"id": "O1", "dimension": "conformance", "severity": "high", "location": "playbooks/site.yml:18", "description": "`make lint` is RED on `main`: site.yml imports the `docker_host` role which does not exist, so ansible-lint syntax-check fails on a clean checkout. Violates CLAUDE.md 'main must always work' and 'Never skip lint' (pre-commit would block every commit unless bypassed).", "suggested_fix": "Decide an interim posture: guard the docker_host play (e.g. skip until the role exists), stub the role via `make new-role NAME=docker_host`, or exclude site.yml from syntax-check until built — and record it. Judgement call.", "tag": "new", "auto_fixable": false},
{"id": "O2", "dimension": "consistency", "severity": "high", "location": "docs/decisions/004-docker-model.md:105 ↔ docs/decisions/022-backup.md", "description": "ADR-004 'Persistent data' says 'Backup strategy is defined separately (not in scope of this repo).' ADR-022 defines a full in-repo backup strategy (backup role, fisi pull node, per-service backup__* + BACKUP.md). Direct ADR↔ADR contradiction on scope.", "suggested_fix": "Update ADR-004's line to point at ADR-022 (backup is now in-repo scope) and cross-link, per ADR-023's no-silent-reversal rule. Design decision — report only.", "tag": "new", "auto_fixable": false},
{"id": "O3", "dimension": "consistency", "severity": "medium", "location": "docs/decisions/004-docker-model.md:48-49", "description": "ADR-004's service-role file table (the canonical standard) lists only SECURITY.md + VERIFY.md, but CLAUDE.md + ADR-021/ADR-022 now mandate ACCESS.md (every service role) and BACKUP.md (stateful service roles).", "suggested_fix": "Add ACCESS.md (ADR-021) and BACKUP.md (ADR-022) rows to ADR-004's service-role file table. (Prior O1 'missing VERIFY.md' is now resolved — this is the next evolution.)", "tag": "new", "auto_fixable": false},
{"id": "O4", "dimension": "consistency", "severity": "medium", "location": "docs/CAPABILITIES.md:149-154 ↔ STATUS.md:29", "description": "CAPABILITIES lists nvim/tmux/shell config as a CONFIRMED EXCLUSION ('boma is server-only, so these are correctly absent'), but the dev_env role (built+applied to ubongo) installs exactly zsh+oh-my-zsh+tmux+neovim.", "suggested_fix": "Carve out an exception for the control-node developer/AI-worker environment (ubongo, ADR-015) rather than flatly excluding nvim/tmux; distinguish infra worker-host config from personal desktops.", "tag": "new", "auto_fixable": false},
{"id": "O5", "dimension": "drift", "severity": "medium", "location": "docs/decisions/002-security.md:82", "description": "References `make deploy PLAYBOOK=upgrade` as the deliberate full-upgrade mechanism, but no upgrade.yml playbook exists (only bootstrap/site/workstation) and ADR-011 update-management is still Proposed/unbuilt — stated without the '(planned)' caveat ADR-002 uses for its other unbuilt controls.", "suggested_fix": "Add a '(planned — ADR-011, not yet built)' caveat to the upgrade line.", "tag": "new", "auto_fixable": false},
{"id": "O6", "dimension": "drift", "severity": "medium", "location": "inventories/production/hosts.yml:7-16; inventories/staging/hosts.yml:7-14", "description": "Committed hosts.yml stubs omit the offsite_hosts group, but it is one of the four VALID_GROUPS in tf_to_inventory.py and in ADR-009/ADR-016/CLAUDE.md; the next `make tf-inventory` would add it, so the hand-stubs have drifted. (Prior O4 'askari group unnamed' is resolved — naming is now consistent; this is the residual stub gap.)", "suggested_fix": "Regenerate via `make tf-inventory TF_ENV=production` and `TF_ENV=staging` (do NOT hand-edit hosts.yml — CLAUDE.md), or accept the stubs lag until TF runs.", "tag": "new", "auto_fixable": false},
{"id": "O7", "dimension": "drift", "severity": "medium", "location": "docs/runbooks/new-host.md:81-130", "description": "Part E (control node ubongo) instructs creating an 'ansible' user and 'ssh ansible@<IP>', but STATUS.md records ubongo is deliberately managed as the operator account sjat (group_vars/control ansible_user: sjat) with the ansible-user bootstrap listed as Pending.", "suggested_fix": "Update Part E to reflect ubongo managed as sjat (no ansible user yet), ansible-user bootstrap a pending item per STATUS.md.", "tag": "new", "auto_fixable": false},
{"id": "O8", "dimension": "conformance", "severity": "medium", "location": "roles/dev_env/tasks/per_user.yml:2-9", "description": "The getent + `set_fact: dev_env__home` preflight is untagged, but downstream tasks that consume dev_env__home carry concern tags (users, config). A partial `--tags users` or `--tags config` run skips the set_fact, leaving dev_env__home undefined and failing the tagged tasks — against ADR-019's concern-runnable-in-isolation intent.", "suggested_fix": "Tag the preflight with the union of dependent concerns ([users, config]) or `always`.", "tag": "new", "auto_fixable": false},
{"id": "O9", "dimension": "consistency", "severity": "medium", "location": "STATUS.md:31 ↔ docs/decisions/007-network.md", "description": "STATUS places ubongo at 10.20.10.151; ADR-007 defines srv as 10.20.0.0/24 and mgmt as 10.10.0.0/24 — 10.20.10.151 is in neither. base__firewall_control_addr (ADR-021 recovery path) depends on this address being correct. Already a tracked follow-up in the ubongo-build plan (line 147).", "suggested_fix": "Either correct ubongo's recorded address to a valid ADR-007 subnet, or amend ADR-007 to document the actual VLAN/subnet ubongo's physical port lives on, before base__firewall_control_addr is populated.", "tag": "new", "auto_fixable": false},
{"id": "O10", "dimension": "drift", "severity": "low", "location": "README.md:104-106", "description": "README's Documentation ADR list stops at 017; ADRs 018 (logging), 019 (tagging), 020 (firewall), 021 (access), 022 (backup), 023 (ADR structure) exist and are in CLAUDE.md's full table. Partial enumeration is now stale. (Evolved from prior O3, which is otherwise resolved — the docs/ tree omissions were fixed in AF4.)", "suggested_fix": "Extend the list through 023, or trim it to a pointer at CLAUDE.md's full table to avoid a stale partial list.", "tag": "recurring", "auto_fixable": false},
{"id": "O11", "dimension": "conformance", "severity": "low", "location": "docs/decisions/008-testing.md:3; 014-knowledge-sourcing.md:98; 016-mesh-vpn.md:91; 017-service-ui-verification.md:66; 018-logging.md:73", "description": "ADR-023 §2 mandates section order Status→Context→Decision→Consequences. ADR-008 injects a gotchas blockquote before ## Status; ADR-014's ## Decision is a late summary after six topical sections; ADR-016/017/018 place ## Status mid-document. The scan checks presence, not order, so all pass lint — but they don't match the stated standard.", "suggested_fix": "Presentational restructure per ADR-023 §6 (move Status first; pull Decision up). No decision substance changes. Judgement call — report.", "tag": "new", "auto_fixable": false},
{"id": "O12", "dimension": "consistency", "severity": "low", "location": "docs/decisions/007-network.md:160", "description": "The naming-scheme table states the public FQDN convention is `<service>.baobab.band`, but its own example is `forgejo.nyumbani.baobab.band` (extra nyumbani label). The nyumbani split-horizon sub-label is still OPEN (TODO 4); convention and example disagree.", "suggested_fix": "Change the example to forgejo.baobab.band, or note nyumbani is an unresolved split-horizon sub-label (TODO 4). Ties to an open decision — report.", "tag": "new", "auto_fixable": false},
{"id": "O13", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/files/dotfiles/zsh/.zshrc:28,55", "description": "Shipped .zshrc hard-codes `alias rclone=\"/usr/bin/rclone\"` (rclone is not installed by dev_env) and `eval \"$(direnv hook zsh)\"` unguarded (unlike the guarded oh-my-posh block) — heritage fisi/V4 carryovers. If direnv is dropped from dev_env__packages every shell startup errors.", "suggested_fix": "Drop the rclone alias (role doesn't install it) and guard the direnv hook with `command -v direnv`, or document direnv as a hard dependency of the shipped .zshrc.", "tag": "new", "auto_fixable": false},
{"id": "O14", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/tasks/oh_my_posh.yml:15-26", "description": "The zen.toml theme-directory + deploy tasks render config to disk but carry no `config` tag, while analogous dotfile tasks in per_user.yml are tagged `config` — inconsistent concern tagging within the role.", "suggested_fix": "Add tags: [config] to the zen.toml directory + deploy tasks.", "tag": "new", "auto_fixable": false},
{"id": "O15", "dimension": "consistency", "severity": "low", "location": "terraform/environments/production/terraform.tfvars.example:9-11; staging/terraform.tfvars.example", "description": "proxmox_node/endpoint examples use pve01 / pve01.baobab.band, but ADR-007 defines Proxmox node names as pve0/pve1/pve2 (single digit, no leading zero). Example contradicts the naming convention.", "suggested_fix": "Change example values to pve0 / pve0.baobab.band (both envs). Verify the actual node name first — report rather than auto-fix.", "tag": "new", "auto_fixable": false},
{"id": "O16", "dimension": "consistency", "severity": "low", "location": "docs/decisions/013-heritage-v4.md:77; docs/decisions/015-control-host.md", "description": "ADR-013 and ADR-015 close with an inline 'See also:' prose line, whereas ADRs 014/019/020/021/022 and the adr-template use a dedicated `## Related` section. Stylistic inconsistency (## Related is optional per ADR-023 §3).", "suggested_fix": "Convert the 'See also:' prose in ADR-013/015 into ## Related sections for uniformity. Cosmetic.", "tag": "new", "auto_fixable": false},
{"id": "O17", "dimension": "cruft", "severity": "low", "location": "roles/dev_env/handlers/main.yml; roles/base/handlers/main.yml", "description": "Both roles ship an empty handlers/main.yml (only `---`); neither defines or uses handlers (base's firewall apply/rollback is deliberately in tasks). Scaffold artifacts from make new-role.", "suggested_fix": "Confirm whether empty scaffold files are an intentional convention; if not, delete. Low priority.", "tag": "new", "auto_fixable": false},
{"id": "O18", "dimension": "consistency", "severity": "low", "location": "docs/README.md:5-8; inventories/README.md:1-12", "description": "docs/README.md lists only decisions/ + runbooks/ (omits security/testing/access/backup/hardware/reviews/superpowers); inventories/README.md omits the offsite_hosts group documented in CLAUDE.md. Both are narrower than current reality.", "suggested_fix": "Add the missing subdir rows / note offsite_hosts, or explicitly defer to the canonical list. Low priority.", "tag": "new", "auto_fixable": false}
],
"prior_resolved": [
{"id": "O1@2026-06-05", "description": "ADR-004 service-role table missing VERIFY.md row", "status": "resolved — table now lists SECURITY.md + VERIFY.md (next gap ACCESS/BACKUP tracked as O3)"},
{"id": "O2@2026-06-05", "description": "new-role runbook missing VERIFY.md step", "status": "resolved — step 10 present"},
{"id": "O3@2026-06-05", "description": "README ADR list + docs/ tree omissions", "status": "partial — docs tree security/testing/hardware now present; access/backup fixed in AF4; ADR-list staleness carried as O10"},
{"id": "O4@2026-06-05", "description": "askari inventory group unnamed", "status": "resolved — offsite_hosts named consistently (residual stub gap = O6)"},
{"id": "O5@2026-06-05", "description": "backend.tf mislabelled Forgejo state backend", "status": "resolved — now labelled local state"},
{"id": "O6@2026-06-05", "description": "ADR-014 plugin reproducibility described open but TODO done", "status": "resolved"},
{"id": "O11@2026-06-05", "description": "CAPABILITIES missing /verify-service Level-4 row", "status": "resolved — present (§10)"},
{"id": "O12@2026-06-05", "description": "TODO 3.10 garbled", "status": "resolved — readable"},
{"id": "O7-O10@2026-06-05", "description": "ADR-011 digest-pinning row; act_runner ambiguity; WireGuard Molecule row; ADR-011 scheduled_jobs cross-ref", "status": "not re-detected this run (ADR-011 still Proposed) — verify on next run"}
]
}

View file

@ -0,0 +1,161 @@
# Repo review — 2026-06-11
- **Reviewed commit:** `67f2aba` (main)
- **Mode:** on-demand (interactive)
- **Previous run:** `2026-06-05` (commit `f566fd1`)
- **Process:** Phase 0 deterministic scan → 5 parallel shard reviewers + 1 cross-cutting
reviewer → synthesis, deferral-checklist resolution, prior-run diff → safe auto-fixes.
## Summary
| | High | Medium | Low | Total |
|---|---|---|---|---|
| **Auto-fixed** | 1 | 2 | 2 | 5 |
| **Open (report-only)** | 2 | 7 | 9 | 18 |
By dimension (open): conformance 3 · consistency 8 · drift 6 · cruft 1.
**Headline:** `make lint` is currently **red on `main`**`playbooks/site.yml` imports the
not-yet-existent `docker_host` role (confirmed at clean HEAD, unrelated to this run's
edits). That breaks CLAUDE.md's "main must always work" / "Never skip lint" contract and
is the top open finding (O1). The bulk of the rest is documentation drift created by the
recent `base` (firewall) + `dev_env` build wave: several READMEs/playbook notes still
described the roles as "empty / not built." Those were the safe auto-fixes.
**Good news:** 7 of the 12 open findings from the 2026-06-05 run are confirmed resolved
(VERIFY.md row + runbook step, backend.tf relabel, askari group naming, ADR-014
reproducibility, CAPABILITIES Level-4 row, TODO 3.10). The deferral checklist is clean —
**0 stale-deferred** this run (the recurring miss logged in FRICTION.md did not recur).
## Auto-fixes applied
Markdown / YAML-comment only; no runtime behaviour, logic, vars, or task order touched.
| ID | Sev | File(s) | What |
|---|---|---|---|
| AF1 | high | `roles/README.md` | Rewrote stale "base & docker_host are empty untracked dirs, site.yml would fail on a clean clone" → base partially built (firewall), docker_host not yet created, dev_env built+applied. |
| AF2 | med | `playbooks/site.yml` | NOTE no longer claims base is unbuilt / "fails on a clean clone"; now reflects firewall-only base + missing docker_host. |
| AF3 | med | `playbooks/README.md` | Dropped the "currently a no-op" claim; added a `workstation.yml` bullet. |
| AF4 | low | `README.md` | Added `docs/access/`, `docs/backup/`, `roles/dev_env/`, `playbooks/workstation.yml` to the project-structure tree. |
| AF5 | low | `docs/decisions/016-mesh-vpn.md`, `docs/decisions/020-firewall.md` | Added the reciprocal `ADR-021` cross-reference that ADR-021 says it amended in. |
> `make lint` was re-run after the fixes: it fails **only** on the pre-existing
> `docker_host` syntax-check (O1), identical to clean HEAD. No auto-fix introduced or
> changed any lint result, so none were reverted.
## Open findings (prioritised)
### High
- **O1 — `make lint` is red on `main`** · `playbooks/site.yml:18` · *conformance*
site.yml imports the `docker_host` role, which does not exist, so ansible-lint's
syntax-check fails on a clean checkout. Violates "main must always work" + "Never skip
lint" (pre-commit would block every commit unless bypassed).
*Fix (judgement):* guard/skip the docker_host play until the role exists, scaffold a
stub via `make new-role NAME=docker_host`, or exclude site.yml from syntax-check until
built — and record the choice. **new**
- **O2 — ADR-004 ↔ ADR-022 backup-scope contradiction** ·
`docs/decisions/004-docker-model.md:105` · *consistency*
ADR-004 says "Backup strategy is defined separately (not in scope of this repo)";
ADR-022 defines a full in-repo backup strategy. Per ADR-023 (no silent reversals),
update ADR-004's line to defer to ADR-022 and cross-link. Design decision — report. **new**
### Medium
- **O3 — ADR-004 service-role file table missing ACCESS.md + BACKUP.md** ·
`docs/decisions/004-docker-model.md:48` · *consistency* — CLAUDE.md + ADR-021/022 now
mandate both for service roles; the canonical table lists only SECURITY.md + VERIFY.md.
(Prior "missing VERIFY.md" is resolved; this is the next evolution.) **new**
- **O4 — CAPABILITIES nvim/tmux exclusion ↔ dev_env built** ·
`docs/CAPABILITIES.md:149` · *consistency* — listed as a confirmed exclusion
("server-only"), but `dev_env` (built+applied to ubongo) installs exactly that. Carve
out the control-node/AI-worker exception (ADR-015). **new**
- **O5 — phantom `make deploy PLAYBOOK=upgrade`** · `docs/decisions/002-security.md:82` ·
*drift* — no `upgrade.yml` exists; ADR-011 is unbuilt. Add a "(planned)" caveat. **new**
- **O6 — hosts.yml stubs missing `offsite_hosts` group** ·
`inventories/{production,staging}/hosts.yml` · *drift* — the generator emits it (one of
four VALID_GROUPS); the hand-stubs predate the standard. Regenerate via
`make tf-inventory` (don't hand-edit). (Prior "askari group unnamed" is resolved.) **new**
- **O7 — new-host runbook Part E vs ubongo reality** · `docs/runbooks/new-host.md:81-130`
· *drift* — instructs creating an `ansible` user / `ssh ansible@`; STATUS records ubongo
is managed as `sjat`, ansible-user bootstrap pending. **new**
- **O8 — dev_env untagged `set_fact` under tagged consumers** ·
`roles/dev_env/tasks/per_user.yml:2-9` · *conformance* — partial `--tags users|config`
runs skip the `dev_env__home` set_fact and fail. Tag the preflight `[users, config]` or
`always`. **new**
- **O9 — ubongo address outside ADR-007 subnets** · `STATUS.md:31 ↔ 007-network.md` ·
*drift* — 10.20.10.151 is in neither srv (10.20.0.0/24) nor mgmt (10.10.0.0/24);
`base__firewall_control_addr` depends on it. Already a tracked follow-up in the
ubongo-build plan. Reconcile address or ADR-007. **new**
### Low
- **O10 — README ADR list stops at 017** · `README.md:104` · *drift* — 018023 exist;
extend or trim to a pointer. **recurring** (evolved from prior O3)
- **O11 — ADR section-order vs ADR-023 §2** · `008:3, 014:98, 016:91, 017:66, 018:73` ·
*conformance* — Status-not-first / Decision-late; passes lint (order not gated) but not
the standard. Presentational restructure. **new**
- **O12 — ADR-007 FQDN convention vs its own example** · `007-network.md:160` ·
*consistency*`<service>.baobab.band` vs `forgejo.nyumbani.baobab.band`; ties to open
TODO 4 (split-horizon). **new**
- **O13 — dev_env `.zshrc` heritage carryovers** ·
`roles/dev_env/files/dotfiles/zsh/.zshrc:28,55` · *consistency* — hard-coded
`/usr/bin/rclone` alias (not installed by the role) + unguarded `direnv` hook. **new**
- **O14 — oh_my_posh config tasks untagged** · `roles/dev_env/tasks/oh_my_posh.yml:15-26`
· *consistency* — inconsistent `config` tagging vs per_user.yml. **new**
- **O15 — tfvars.example `pve01` vs ADR-007 `pve0`** ·
`terraform/environments/*/terraform.tfvars.example:9` · *consistency* — verify the real
node name, then align. **new**
- **O16 — ADR-013/015 "See also:" vs `## Related`** · *consistency* — stylistic; convert
for uniformity. **new**
- **O17 — empty scaffold `handlers/main.yml`** · `roles/{dev_env,base}/handlers/main.yml`
· *cruft* — confirm convention or delete. **new**
- **O18 — docs/README.md + inventories/README.md narrower than reality** · *consistency*
— omit several real subdirs / the offsite_hosts group. **new**
## Deferral checklist (Phase 2)
| Source | Items | Verdict |
|---|---|---|
| ADR-011 Deferred/Open | 5 (snapshot driver, cadences, health-check harness home, classification home, staging-first) | **All genuinely still open** — cross-checked against later ADRs + TODO 16. None silently resolved. |
| ADR-015 Deferred | #1 mesh VPN, #2 service-UI, #3 build | **All marked RESOLVED in place** (ADR-016 / ADR-017 / 2026-06-11 build). |
**Stale-deferred found: 0.** The recurring FRICTION.md miss did not recur this run.
## Scan false positives (folded in, not actionable)
- `broken-path-ref STATUS.md:38` — STATUS legitimately documents `roles/docker_host/` as
"Not in git." (intentional reference to an unbuilt role).
- `broken-adr-ref` ×4 — `ADR-099`/`ADR-100` in `tests/test_repo_scan.py` and the
adr-structure plan are intentional **test fixtures** for the scanner's bad-ref check.
- `marker` ×14 — all in `docs/superpowers/{plans,specs}/*` (historical commit-message
TODOs / plan steps) or prose discussing "over-tagging" as a concept. Not cruft.
## Prior-run diff (vs 2026-06-05)
**Resolved (7):** O1 VERIFY.md row · O2 new-role VERIFY step · O4 askari group naming ·
O5 backend.tf relabel · O6 ADR-014 reproducibility · O11 CAPABILITIES Level-4 row ·
O12 TODO 3.10. **Partial:** O3 (docs tree fixed in AF4; ADR-list carried as O10).
**Not re-detected (verify next run):** O7O10 (ADR-011 still Proposed).
## Follow-up prompt (copy-paste)
> Act on the open findings from `docs/reviews/2026-06-11-review.md`. Priority order:
> 1. **O1 (high):** `make lint` is red on `main``playbooks/site.yml` imports the
> non-existent `docker_host` role. Pick an interim posture (guard/skip the play, or
> `make new-role NAME=docker_host` to scaffold a stub, or exclude from syntax-check
> until built) so the trunk lints clean again, and record the choice in STATUS.md.
> 2. **O2 (high):** Resolve the ADR-004 ↔ ADR-022 backup-scope contradiction —
> update ADR-004's "not in scope of this repo" line to defer to ADR-022 (per ADR-023's
> no-silent-reversal rule) and cross-link.
> 3. **O3:** Add ACCESS.md + BACKUP.md rows to ADR-004's service-role file table.
> 4. **O4:** Reconcile CAPABILITIES' nvim/tmux exclusion with the built `dev_env` role
> (carve out the ubongo control-node exception).
> 5. **O8 (conformance):** Tag the `dev_env__home` preflight `set_fact` so partial
> `--tags users|config` runs don't fail.
> 6. **O6 / O9:** Regenerate the inventory stubs to include `offsite_hosts`; reconcile
> ubongo's 10.20.10.151 against ADR-007's subnets (or amend ADR-007).
> 7. Sweep the low-severity doc items (O5 caveat, O7 runbook, O10 ADR list, O11 ADR
> section order, O12O18) as a single docs-hygiene batch.
> Run `make lint` before committing; commit per CLAUDE.md git conventions.

View file

@ -0,0 +1,76 @@
{
"date": "2026-06-14",
"reviewed_commit": "e346137",
"fixes_commit": null,
"mode": "on-demand",
"counts": {
"auto_fixed": 11,
"open": 29,
"scan": {
"broken-adr-ref": 4,
"broken-path-ref": 2,
"marker": 14,
"open-deferred-item": 5,
"stale-deferred": 0
}
},
"deferral_checklist": {
"adr-011-open-items": "all 5 ('Open questions': Proxmox snapshot driver, exact cadences, health-check harness home, classification home, staging-first) confirmed genuinely still open. ADR-011 is still Proposed/unbuilt; the same questions are echoed open in docs/TODO.md item 16; no later ADR or STATUS decides any of them. No stale-deferred.",
"stale_deferred_found": 0
},
"scan_false_positives": [
{"check": "broken-adr-ref", "location": "tests/test_repo_scan.py:10,43; docs/superpowers/plans/2026-06-10-adr-structure.md:50,83", "why": "ADR-099/ADR-100 are intentional test fixtures exercising the scanner's bad-ref detection."},
{"check": "broken-path-ref", "location": "docs/superpowers/plans/2026-06-14-m4b-netbird.md:28,56", "why": "roles/netbird/ is referenced by the M4b implementation plan for a role to be scaffolded via make new-role; forward-looking plan for unbuilt work, not a dead ref."},
{"check": "marker", "location": "docs/decisions/019-tagging.md:14 + docs/superpowers/plans/* + docs/superpowers/specs/*", "why": "019-tagging.md:14 is prose discussing 'over-tagging' as a concept ('the TODO explicitly warns against...'), not an actionable TODO. The 13 superpowers markers are historical planning artifacts (commit-message TODOs, plan steps)."}
],
"auto_fixed": [
{"id": "AF1", "dimension": "drift", "severity": "high", "location": "roles/reverse_proxy/meta/main.yml:4-6", "description": "meta description said 'ACME DNS-01 TLS via Gandi ... builds the custom image on-host (caddy-dns/gandi)' — but the role is now vanilla Caddy + HTTP-01 (commit b7e919d dropped the custom image); README/defaults/compose/STATUS all reflect vanilla. Only meta was stale and contradicted the code.", "fix": "rewrote description to 'Vanilla Caddy reverse proxy (ADR-024); TLS via ACME HTTP-01 for public hosts. Routes from reverse_proxy__routes, managed via Docker Compose.'", "tag": "new"},
{"id": "AF2", "dimension": "cruft", "severity": "medium", "location": "roles/README.md:11-15", "description": "Current-state paragraph said base hardening (SSH/fail2ball), auditd, packages, users 'not yet built' and docker_host 'scaffolded but has no tasks yet' — but STATUS records the hardening concern built+tested+applied to askari, and docker_host/reverse_proxy/public_dns all built.", "fix": "rewrote to: base firewall+hardening built (hardening applied to askari), docker_host/reverse_proxy/public_dns/dev_env built; auditd/packages/users pending.", "tag": "recurring"},
{"id": "AF3", "dimension": "drift", "severity": "medium", "location": "playbooks/README.md:6-13", "description": "site.yml note said docker_host 'scaffolded with no tasks yet' (now installs Docker engine) and the file omitted dns.yml and offsite.yml entirely.", "fix": "reworded site.yml note (base firewall+hardening, no cluster docker hosts yet) and added dns.yml + offsite.yml bullets.", "tag": "new"},
{"id": "AF4", "dimension": "cruft", "severity": "low", "location": "roles/public_dns/README.md:7-9", "description": "'the anti-spoof baseline now; askari in M4' — M4a is done; askari + *.askari records are applied.", "fix": "updated to note askari.wingu.me + *.askari wildcard applied in M4a.", "tag": "new"},
{"id": "AF5", "dimension": "cruft", "severity": "low", "location": "scripts/README.md:17", "description": "Helper-script list omitted check-tags.py, which exists and is run by make lint (ADR-019).", "fix": "added a check-tags.py bullet.", "tag": "new"},
{"id": "AF6", "dimension": "drift", "severity": "medium", "location": "terraform/README.md:7-15", "description": "Top-level terraform README omitted modules/hetzner_vm and environments/offsite — the only built+applied TF environment (askari).", "fix": "added hetzner_vm + offsite env bullets; scoped 'not yet init'ed' to the Proxmox envs.", "tag": "new"},
{"id": "AF7", "dimension": "cruft", "severity": "low", "location": "terraform/environments/offsite/providers.tf:1", "description": "Verified-stamp said 'cax11@hel1' but the deployed server is cx23 (CAX11 out of stock).", "fix": "stamp now reads cx23@hel1.", "tag": "new"},
{"id": "AF8", "dimension": "cruft", "severity": "low", "location": "terraform/modules/hetzner_vm/variables.tf:7", "description": "server_type description example was 'e.g. cax11 (ARM)'; the only consumer uses cx23.", "fix": "example now 'e.g. cx23 (x86) or cax11 (ARM)'.", "tag": "new"},
{"id": "AF9", "dimension": "drift", "severity": "medium", "location": "inventories/production/group_vars/all/public_dns.yml:16-17", "description": "Comment on the *.askari wildcard said 'Caddy gets a *.askari.wingu.me cert via DNS-01 (M4a)' — M4a uses HTTP-01 (the wildcard A record itself is still legitimately needed for name resolution).", "fix": "comment now says per-host certs via ACME HTTP-01 (M4a).", "tag": "new"},
{"id": "AF10", "dimension": "drift", "severity": "high", "location": "docs/CAPABILITIES.md:27,29", "description": "Capability table named Traefik as the reverse-proxy candidate (ADR-024 chose Caddy, built+applied) and marked public DNS 'apply pending' (applied 2026-06-14).", "fix": "reverse-proxy row -> 'Caddy (ADR-024)'; public DNS note -> 'applied (M1)'. (The V4-history Traefik mention at line 134 is correct and left as-is.)", "tag": "new"},
{"id": "AF11", "dimension": "cruft", "severity": "low", "location": "README.md:110-119", "description": "README 'Documentation' ADR list stopped at ADR-017; ADR-018..024 exist.", "fix": "extended the list through ADR-024 (logging, tagging, firewall, access, backup, ADR-structure, reverse-proxy).", "tag": "recurring"}
],
"open": [
{"id": "O1", "dimension": "drift", "severity": "high", "location": "STATUS.md:41 (+ 45-48) ↔ STATUS.md:33-34", "description": "The 'Scaffolded but empty — NOT implemented' table still lists roles/docker_host as 'Scaffolded, no tasks ... applying it is a no-op', and the trailing prose (45-48) repeats it. This contradicts STATUS.md:33-34 ('Built + applied', installs Docker CE + compose) and the actual roles/docker_host/tasks/main.yml. An internal STATUS contradiction; one side is plainly correct (docker_host is built).", "suggested_fix": "Remove/rewrite the docker_host row in the 'Scaffolded but empty' table and the 45-48 paragraph: docker_host now installs the Docker engine; only its deferred daemon-hardening + nftables.d scope (ADR-004/020) remains. Report (STATUS is the operator's ground-truth doc — reword deliberately).", "tag": "new", "auto_fixable": false},
{"id": "O2", "dimension": "consistency", "severity": "high", "location": "docs/decisions/004-docker-model.md:105,131 ↔ docs/decisions/022-backup.md", "description": "ADR-004 states twice that 'Backup strategy is defined separately (not in scope of this repo)'. ADR-022 defines a full in-repo backup/DR doctrine (restic, fisi pull node, per-service backup__* + BACKUP.md). Direct ADR↔ADR scope contradiction.", "suggested_fix": "Reword ADR-004's lines to point at ADR-022 (backup is now in-repo scope) and cross-link, per ADR-023's no-silent-reversal rule. Design decision — report.", "tag": "recurring", "auto_fixable": false},
{"id": "O3", "dimension": "consistency", "severity": "high", "location": "docs/decisions/024-reverse-proxy.md (Consequences) ↔ 008-testing.md:70; 017-service-ui-verification.md:27,88; 019-tagging.md:52", "description": "ADR-024's Consequences claim 'ADR-017 prose that mentioned Traefik is updated to read Caddy'. That update was NOT done: ADR-017:27,88 still say 'Traefik + Authentik'; ADR-008:70 'Traefik + Authentik SSO flow'; ADR-019:52 'Traefik routes, Authentik'. The doc set still designs around Traefik while ADR-024 overclaims the reconciliation was completed.", "suggested_fix": "Replace Traefik with Caddy (ADR-024) in ADR-008:70, ADR-017:27,88, ADR-019:52, OR soften ADR-024's Consequences to 'to be updated'. ADR prose = design docs — report (not auto-fixed).", "tag": "new", "auto_fixable": false},
{"id": "O4", "dimension": "conformance", "severity": "high", "location": "docs/decisions/023-adr-structure.md:7-8,77-80 ↔ 016-mesh-vpn.md:3; 017-service-ui-verification.md:3; 018-logging.md:3", "description": "ADR-023 §2 mandates ## Status as the first section and §6 explicitly claims ADRs 001018 were retroactively restructured to lead with Status (calling out 016018). But ADR-016/017/018 still open with ## Context, Status buried late (016:~92, 017:~66, 018:~73). ADR-023's own conformance claim is contradicted by three in-scope files. (Older ADRs 001010 lead with Status but place Decision/Consequences after topical sections — an accepted presentational trade-off per ADR-023 §5/§6.)", "suggested_fix": "Either add a top-of-file ## Status section to ADR-016/017/018 (move the existing build-state line up), or correct ADR-023 §6 to exclude them. Reordering judgement — report.", "tag": "recurring", "auto_fixable": false},
{"id": "O5", "dimension": "consistency", "severity": "medium", "location": "docs/decisions/004-docker-model.md:48-50", "description": "The service-role file table (the canonical standard) lists only README/SECURITY/VERIFY; it omits ACCESS.md (ADR-021) and BACKUP.md (ADR-022), both of which CLAUDE.md + those ADRs mandate as required per-service-role files.", "suggested_fix": "Add ACCESS.md (ADR-021) and BACKUP.md (ADR-022, stateful) rows to ADR-004's file table.", "tag": "recurring", "auto_fixable": false},
{"id": "O6", "dimension": "drift", "severity": "medium", "location": "docs/decisions/002-security.md:82", "description": "References 'make deploy PLAYBOOK=upgrade' as the deliberate full-upgrade mechanism, but no upgrade.yml exists (only bootstrap/dns/offsite/site/workstation) and ADR-011 is still Proposed/unbuilt — stated without the '(planned)' caveat ADR-002 uses for its other unbuilt controls.", "suggested_fix": "Add a '(planned — ADR-011, not yet built)' caveat to the upgrade line, or drop the concrete command until upgrade.yml exists.", "tag": "recurring", "auto_fixable": false},
{"id": "O7", "dimension": "drift", "severity": "medium", "location": "docs/CAPABILITIES.md:150-155 ↔ STATUS.md:29", "description": "CAPABILITIES still lists nvim/kitty/tmux among 'Confirmed exclusions' boma 'deliberately does not' have, but the dev_env role (built+applied to ubongo) installs neovim + tmux. (The reverse-proxy/public-DNS rows in this file were auto-fixed in AF10; this exclusions block was left because it needs a scoped carve-out, not a token swap.)", "suggested_fix": "Scope the exclusion to managed cluster/server hosts and note the control/dev host (ubongo, ADR-015) runs an interactive dev_env, or drop nvim/tmux from the list.", "tag": "recurring", "auto_fixable": false},
{"id": "O8", "dimension": "conformance", "severity": "medium", "location": "roles/dev_env/tasks/main.yml (include_tasks per_user.yml) + roles/dev_env/tasks/per_user.yml:4-9", "description": "per_user.yml's getent + set_fact dev_env__home preflight is untagged, and the include_tasks that pulls it in carries no 'apply: tags:'. base/tasks/main.yml documents and guards exactly this gotcha with apply: tags:; dev_env does not. A partial --tags users or --tags config run selects only the include statement (running nothing) or, if made tag-aware, skips the set_fact and fails the dependent [config] tasks on an undefined dev_env__home. Against ADR-019's concern-runnable-in-isolation intent.", "suggested_fix": "Add apply: tags: [users, config] to the per_user.yml include (mirroring base), and tag the getent+set_fact with 'always' (or the union [users, config]).", "tag": "recurring", "auto_fixable": false},
{"id": "O9", "dimension": "drift", "severity": "medium", "location": "inventories/production/hosts.yml:1-17", "description": "Header claims 'Generated from Terraform outputs: make tf-inventory TF_ENV=production', but the file is hand-maintained: it carries the manual control host (ubongo) and omits the offsite_hosts group that tf_to_inventory.py always emits (VALID_GROUPS). Running tf-inventory against the empty production env would DROP ubongo and ADD offsite_hosts, so the header misrepresents how the file is managed.", "suggested_fix": "Make the header honest (hand-maintained for the manual control-node exception while production TF has no VMs; offsite hosts live in offsite.yml), and reconcile the declared group set with tf_to_inventory.py. Do NOT hand-regenerate hosts.yml in a way that drops ubongo.", "tag": "recurring", "auto_fixable": false},
{"id": "O10", "dimension": "consistency", "severity": "medium", "location": "inventories/production/group_vars/all/vars.yml:42 + hosts.yml:12 ↔ docs/decisions/007-network.md", "description": "ubongo's address is 10.20.10.151 (control host_var + base__firewall_control_addr), but ADR-007 defines srv as 10.20.0.0/24 (network__srv_subnet) and mgmt as 10.10.0.0/24 — 10.20.10.151 is in neither, and ADR-007's addressing tables don't record where the physical control node lives. base__firewall_control_addr (ADR-021 recovery path) depends on this being right.", "suggested_fix": "Add ubongo to ADR-007's addressing table (which VLAN/segment 10.20.10.151 belongs to, clearly outside srv 10.20.0.0/24), or correct the address. Confirm the real address with the operator first.", "tag": "recurring", "auto_fixable": false},
{"id": "O11", "dimension": "consistency", "severity": "medium", "location": "terraform/environments/{staging,production}/terraform.tfvars.example:9-11 + variables.tf:5", "description": "Proxmox node naming uses 'pve01' (two-digit) in both tfvars.example files and the proxmox_endpoint var descriptions; ADR-007 defines single-digit node names pve0/pve1/pve2, and internal FQDNs as <host>.boma.<domain>. Example contradicts the naming convention.", "suggested_fix": "Align example values with ADR-007 (proxmox_node = pve0; endpoint = https://pve0.boma.<domain>:8006/). Verify the intended node name with the operator before changing — report rather than auto-fix.", "tag": "recurring", "auto_fixable": false},
{"id": "O12", "dimension": "conformance", "severity": "medium", "location": "roles/reverse_proxy/ (missing SECURITY.md, VERIFY.md, ACCESS.md, BACKUP.md)", "description": "CLAUDE.md requires every service role to carry SECURITY.md (ADR-002/004), VERIFY.md (ADR-008/017), ACCESS.md (ADR-021), and a stateful BACKUP.md (ADR-022); a stateless service records backup__state: false with a reason. reverse_proxy is the first real built+applied service role (askari, M4a) but ships only README.md. (Judgement recorded: public_dns is exempt — it runs on the control node against an external DNS API, provisioning no host-resident service/port, so it is not a 'service' role in the ADR-004 sense.)", "suggested_fix": "Add the four files from docs/security|testing|access|backup/ templates. BACKUP.md can declare backup__state: false (Caddy state = re-issuable ACME certs).", "tag": "new", "auto_fixable": false},
{"id": "O13", "dimension": "consistency", "severity": "low", "location": "docs/decisions/012-hardware-capacity.md; 013-heritage-v4.md:77; 015-control-host.md; 016-mesh-vpn.md; 017-service-ui-verification.md; 018-logging.md", "description": "Inconsistent cross-reference convention: ADRs 014/019/020/021/022/023 + adr-template use a dedicated '## Related' section, while 012/013/015/016/017/018 use an inline 'See also:' prose line (placed mid-document in 016/017/018). ADR-023 §3 names ## Related as the optional section; 'See also:' is an undocumented variant.", "suggested_fix": "Convert the 'See also:' prose into ## Related sections (after Consequences) in ADR-012/013/015/016/017/018 for uniformity. Cosmetic.", "tag": "recurring", "auto_fixable": false},
{"id": "O14", "dimension": "consistency", "severity": "low", "location": "docs/README.md:4-8; inventories/README.md", "description": "docs/README.md lists only decisions/ + runbooks/ (omits security/testing/access/backup/hardware/reviews); inventories/README.md omits the offsite_hosts group documented in CLAUDE.md. Both narrower than current reality.", "suggested_fix": "Add the missing subdir rows / note offsite_hosts, or explicitly defer to the canonical list in the repo README / CLAUDE.md.", "tag": "recurring", "auto_fixable": false},
{"id": "O15", "dimension": "drift", "severity": "medium", "location": "docs/runbooks/new-host.md:82,114-138 (Part E)", "description": "Part E (control node ubongo) still instructs 'ssh ansible@<IP>' / an ansible-user flow, but STATUS records ubongo is deliberately managed as the operator account sjat (group_vars/control ansible_user: sjat) with the ansible-user bootstrap listed as Pending.", "suggested_fix": "Update Part E to reflect ubongo managed as sjat (no ansible user yet), the ansible-user bootstrap a pending item per STATUS.md.", "tag": "recurring", "auto_fixable": false},
{"id": "O16", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/files/dotfiles/zsh/.zshrc:28,55", "description": "Shipped .zshrc hard-codes alias rclone=\"/usr/bin/rclone\" (rclone not installed by dev_env) and 'eval \"$(direnv hook zsh)\"' unguarded (unlike the guarded oh-my-posh block) — heritage fisi/V4 carryovers. If direnv is dropped from dev_env__packages, every shell startup errors.", "suggested_fix": "Drop the rclone alias and guard the direnv hook with 'command -v direnv', or document direnv as a hard dependency of the shipped .zshrc.", "tag": "recurring", "auto_fixable": false},
{"id": "O17", "dimension": "consistency", "severity": "low", "location": "roles/dev_env/tasks/oh_my_posh.yml:15-26", "description": "The zen.toml theme-directory + deploy tasks render config to disk but carry no 'config' tag, while analogous dotfile tasks in per_user.yml are tagged config — inconsistent concern tagging within the role.", "suggested_fix": "Add tags: [config] to the zen.toml directory + deploy tasks.", "tag": "recurring", "auto_fixable": false},
{"id": "O18", "dimension": "drift", "severity": "medium", "location": "docs/decisions/007-network.md:159,167,186 + 009-provisioning-handoff.md:114 + 016-mesh-vpn.md:90 ↔ 007-network.md:174,184", "description": "Internal-zone name is inconsistent across the doc set: ADR-007:159/167/186, ADR-009:114, ADR-016:90 call it 'boma.baobab.band', while ADR-007:174/184 says infra is '<host>.boma.wingu.me' and the internal zone 'will be renamed to boma.wingu.me' (Phase 2). M1 moved boma's home to wingu.me. A reader can't tell which domain the unbuilt dns role should render.", "suggested_fix": "State the transitional state in one authoritative place (current = boma.baobab.band, target = boma.wingu.me in Phase 2), or align all references on the target. Report.", "tag": "new", "auto_fixable": false},
{"id": "O19", "dimension": "consistency", "severity": "low", "location": "docs/decisions/009-provisioning-handoff.md:122", "description": "M1 retired 'nyumbani' as a naming tier (ROADMAP:70, ADR-007:176). ADR-009:122 still uses 'forgejo.nyumbani.baobab.band' as the worked example of internal-zone data the dns role would render. (Note: STATUS:19 + ADR-003/008/010 use the same name for the LIVE legacy Forgejo host, which is legitimately legacy infra — distinguish.)", "suggested_fix": "Update the ADR-009:122 example to a non-nyumbani name consistent with the retired-nyumbani decision; annotate the legacy Forgejo references as intentionally legacy where they remain.", "tag": "recurring", "auto_fixable": false},
{"id": "O20", "dimension": "drift", "severity": "low", "location": "docs/ROADMAP.md:82-83", "description": "ROADMAP M2 still describes askari as 'CAX11 ARM / Helsinki', but STATUS records it provisioned as cx23/x86 (CAX11/ARM out of stock EU-wide on 2026-06-14). M3/M4 sections got DONE notes; M2's spec line wasn't corrected.", "suggested_fix": "Update ROADMAP M2 to note askari shipped as cx23/x86 (CAX11 unavailable), or add a DONE note mirroring M3/M4.", "tag": "new", "auto_fixable": false},
{"id": "O21", "dimension": "drift", "severity": "low", "location": "docs/decisions/020-firewall.md:91-93", "description": "ADR-020 says askari's Hetzner Cloud Firewall 'NetBird ports (UDP 3478 + TCP 80/443) will be added in M4 when the coordinator role is built' — but M4a is DONE and the firewall already opens 80/443/3478. Future-tense is stale; only the netbird role (M4b) remains.", "suggested_fix": "Update ADR-020 to past tense (80/443/3478 opened in M4a); keep the netbird coordinator role (M4b) caveated as unbuilt.", "tag": "new", "auto_fixable": false},
{"id": "O22", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:60-92", "description": "ADR-024 is internally inconsistent post-revision: the revised Status note says askari ships HTTP-01 with vanilla Caddy (custom-image DNS-01 deferred to Phase 2), but Decision §2 still asserts boma builds/maintains the custom xcaddy+gandi image, §3 says 'fronts the NetBird stack on askari (M4)' (M4b unbuilt), and Consequences still lists 'a custom Caddy image must be built/pushed/kept current' as a present obligation.", "suggested_fix": "Scope the custom-image obligation (§2, Consequences) to the deferred Phase-2 DNS-01 path; soften §3 to reflect that M4a ships a test vhost and the NetBird front-end is M4b. Report (touches decision substance).", "tag": "new", "auto_fixable": false},
{"id": "O23", "dimension": "consistency", "severity": "low", "location": "docs/decisions/001-architecture.md:50 + 016-mesh-vpn.md:87 ↔ docs/ROADMAP.md:116", "description": "The future NetBird service role is named 'netbird_coordinator' in ADR-001:50 + ADR-016:87 (coordinator framing also in STATUS), but ROADMAP M4b:116 calls it 'the netbird service role'. make new-role creates one directory name; the committed names will mismatch the actual role at build time. (The M4b plan at docs/superpowers/plans/2026-06-14-m4b-netbird.md also uses 'netbird'.)", "suggested_fix": "Settle one role name and align ADR-001/016, ROADMAP, and the M4b plan before scaffolding.", "tag": "new", "auto_fixable": false},
{"id": "O24", "dimension": "consistency", "severity": "low", "location": "docs/decisions/024-reverse-proxy.md:22 ↔ docs/ROADMAP.md:71", "description": "ADR-024 describes the M1 ACME DNS-01 wildcard as '*.boma.<domain>' (infra subdomain), while ROADMAP:71 specifies '*.<boma-domain>' (apex). Different name spaces — the cert's actual SAN coverage for unexposed services is ambiguous across the two docs.", "suggested_fix": "Align the wildcard scope (decide *.wingu.me vs *.boma.wingu.me vs both) and state it identically in ADR-024 and ROADMAP.", "tag": "new", "auto_fixable": false},
{"id": "O25", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/molecule/default/verify.yml:11,22; roles/public_dns/molecule/default/verify.yml:12", "description": "Molecule verify tasks use tags: [verify], which is not in the tests/tags.yml vocabulary (concerns/special/opt_ins/playbooks). check-tags.py exempts molecule/ paths so the linter doesn't flag it, and 4 roles use this de-facto convention — but it's an out-of-vocabulary tag the ADR-019 standard doesn't sanction.", "suggested_fix": "Either drop the tags from molecule verify tasks (the linter ignores molecule anyway) or add 'verify' as a sanctioned testing-only tag in tests/tags.yml with an ADR-019 note. Repo-wide convention call.", "tag": "new", "auto_fixable": false},
{"id": "O26", "dimension": "consistency", "severity": "low", "location": "roles/reverse_proxy/templates/Caddyfile.j2:1; docker-compose.yml.j2:1", "description": "Neither rendered template carries an {{ ansible_managed }} header, though ADR-024 §1.2 cites 'one ansible_managed header' as a Caddy advantage. (No template in the repo currently uses ansible_managed — consistent with current practice but inconsistent with the ADR's stated intent.)", "suggested_fix": "Add a commented '# {{ ansible_managed }}' header to both templates (and ideally adopt the convention repo-wide).", "tag": "new", "auto_fixable": false},
{"id": "O27", "dimension": "consistency", "severity": "low", "location": "inventories/production/group_vars/all/reverse_proxy.yml", "description": "reverse_proxy production vars live in group_vars/all/ (every host) though the role only runs on offsite_hosts via offsite.yml; CLAUDE.md establishes an offsite_hosts/ group_vars dir for askari-specific config, which doesn't exist on disk. Harmless today (only askari imports the role) but broader scope than intended.", "suggested_fix": "Consider moving reverse_proxy.yml (and the offsite firewall opens) to group_vars/offsite_hosts/ for scope clarity, or leave if intentionally global. Judgement call.", "tag": "new", "auto_fixable": false},
{"id": "O28", "dimension": "drift", "severity": "low", "location": "scripts/capacity-scan.py:133", "description": "capacity-scan.py cross-checks workload hostnames only against inventories/<env>/hosts.yml. askari lives in inventories/production/offsite.yml, not hosts.yml, so the drift cross-check never sees it. Minor (capacity is intent-based today) but a latent gap as offsite hosts grow.", "suggested_fix": "Also read offsite.yml (or glob inventories/<env>/*.yml host files) so offsite_hosts are included.", "tag": "new", "auto_fixable": false},
{"id": "O29", "dimension": "consistency", "severity": "low", "location": "inventories/production/offsite.yml:1-16 ↔ inventories/production/hosts.yml:7-16", "description": "offsite.yml (generated by tf-inventory-offsite) re-declares control/docker_hosts/proxmox_hosts with empty host maps because tf_to_inventory.py always emits all four VALID_GROUPS — duplicating groups in hosts.yml in the same inventory dir. Ansible merges them harmlessly, but the duplication/merge is undocumented.", "suggested_fix": "Document in inventories/README.md that offsite.yml is a second generated inventory file merged with hosts.yml, or have tf_to_inventory.py emit only non-empty groups for offsite. Leave as-is if intended; just document.", "tag": "new", "auto_fixable": false}
],
"prior_resolved": [
{"id": "O1@2026-06-11", "description": "make lint RED on main (site.yml imported nonexistent docker_host role)", "status": "resolved — docker_host scaffolded (03d33f8) then built (456c27d); make lint green this run."},
{"id": "O10@2026-06-11", "description": "README ADR list stopped early (recurring)", "status": "resolved — auto-fixed this run (AF11), extended through ADR-024."},
{"id": "O17@2026-06-11", "description": "empty handlers/main.yml scaffold artifacts in base/dev_env", "status": "resolved (accepted) — treated as an intentional make new-role scaffold convention; not re-raised."},
{"id": "O2,O3,O4,O5,O6,O7,O8,O9,O11,O12,O13,O14,O15,O16,O18@2026-06-11", "description": "ADR-004 backup scope; ADR-004 ACCESS/BACKUP table; CAPABILITIES nvim/tmux; ADR-002 upgrade caveat; hosts.yml offsite_hosts; new-host Part E; dev_env set_fact tag; ubongo subnet; ADR section order; ADR-007 example; .zshrc rclone/direnv; oh_my_posh config tag; tfvars pve01; See-also vs Related; docs/inventories README narrowness", "status": "still open — carried forward as O2,O5,O7,O6,O9,O15,O8,O10,O4,O18/O19,O16,O17,O11,O13,O14 respectively (renumbered)."}
]
}

View file

@ -0,0 +1,157 @@
# Repo review — 2026-06-14
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
- **Previous run:** 2026-06-11 (`67f2aba`)
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
## Summary
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
### Counts
| Dimension | High | Medium | Low | Total |
|---|---|---|---|---|
| Cruft / staleness | 0 | 0 | 0 | 0 |
| Design conformance | 1 | 2 | 2 | 5 |
| Consistency & intent | 2 | 2 | 9 | 13 |
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
| **Open total** | **4** | **8** | **16** | **29** |
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
### Phase-0 scan
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
not a TODO). Details in the findings JSON.
### Deferral checklist
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
harness home, classification home, staging-first) confirmed **genuinely still open**
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
## Auto-fixes applied
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
descriptions) — no logic, variable, secret, or task-order changes.
| ID | Sev | File | What |
|---|---|---|---|
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1``cx23@hel1` (actual server) |
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)``cx23 (x86) or cax11 (ARM)` |
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik``Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
## Open findings (prioritised)
### High
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
paragraph (left for the operator — STATUS is the ground-truth doc).
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
still say Traefik too. Either finish the rename or soften ADR-024's claim.
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
claims ADRs 001018 were restructured to lead with `## Status`, but 016/017/018 still
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
### Medium
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
carve-out (not a token swap, so left from AF10).
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
`--tags users|config` run breaks (base guards this; dev_env doesn't).
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
ubongo. Make the header honest.
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
`pve0`; verify the real node name before changing.
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
node external-API role, not a host service.)
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
in one place.
### Low
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
Full detail + suggested fixes in `2026-06-14-findings.json`.
## Themes worth a deliberate pass
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
sweep across ADR-008/017/019 closes it.
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
mirror base's `apply: tags:` guard.
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
## Follow-up prompt
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
> with its own revised HTTP-01 Status note.
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
> ACCESS.md + BACKUP.md rows to its service-role file table.
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
> `--tags config` run still resolves the home dir.
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
> batch; commit per CLAUDE.md git conventions.

View file

@ -1,23 +1,157 @@
# Latest repo review
# Repo review — 2026-06-14
Most recent: **2026-05-30** → full report: `docs/reviews/2026-05-30-review.md`
- **Reviewed commit:** `e346137` (docs(plan): M4b — NetBird coordinator service role)
- **Mode:** on-demand (interactive — auto-fixes applied + committed)
- **Previous run:** 2026-06-11 (`67f2aba`)
- **`make lint`:** green before and after fixes (260 files, profile production; check-tags OK).
| | high | medium | low | total |
## Summary
A lot shipped since the last review (M4a: `docker_host` Docker engine, `reverse_proxy`
Caddy applied to askari; offsite Terraform env live; ADR-024). Most findings this run are
the predictable **docs-lagging-the-build** kind — stale "not built yet" notes, a
reverse-proxy that switched from DNS-01/custom-image to vanilla HTTP-01 leaving stale
descriptions behind, and the **Traefik→Caddy** rename only half-propagated through the
ADR set. The previous run's blocker (O1, `make lint` RED) is **resolved**.
### Counts
| Dimension | High | Medium | Low | Total |
|---|---|---|---|---|
| Auto-fixed | 2 | 3 | 2 | 7 |
| Open | 4 | 4 | 9 | 17 |
| Cruft / staleness | 0 | 0 | 0 | 0 |
| Design conformance | 1 | 2 | 2 | 5 |
| Consistency & intent | 2 | 2 | 9 | 13 |
| Docs-vs-reality drift | 1 | 4 | 5 | 10 |
| **Open total** | **4** | **8** | **16** | **29** |
Dominant theme: drift from this session's own changes — residual `.vault_pass`
references after the Vaultwarden/rbw switch, and leftover PR/merge-request language
after going trunk-based.
Plus **11 auto-fixes applied** (3 high, 5 medium, 3 low).
## Suggested follow-up prompt
### Phase-0 scan
> Remediate the boma 2026-05-30 review (`docs/reviews/2026-05-30-review.md`):
> 1. Purge the residual `.vault_pass` references R1R5 → the rbw/Vaultwarden flow.
> 2. Decide the workflow model R6R7 — I lean "keep deploy approval gates, drop the
> PR/merge-request framing"; reconcile ADR-003/008 and CLAUDE.md to match.
> 3. Resolve R8 — scaffold `base`/`docker_host` via `make new-role`, or correct
> STATUS.md/roles/README.md to say the roles don't exist yet.
> 4. Fix the Terraform `vlan_tag` wiring (R9).
> Report on the rest.
`repo-scan.py`: 5 roles, 25 ADRs · broken-adr-ref=4, broken-path-ref=2, marker=14,
open-deferred-item=5, **stale-deferred=0**. Every scan finding is a known false-positive
(test fixtures ADR-099/100; the `roles/netbird/` references in the M4b *plan* for unbuilt
work; superpowers planning artifacts; `019-tagging.md:14` is prose about "over-tagging",
not a TODO). Details in the findings JSON.
### Deferral checklist
All 5 ADR-011 "Open questions" (Proxmox snapshot driver, exact cadences, health-check
harness home, classification home, staging-first) confirmed **genuinely still open**
ADR-011 is still Proposed/unbuilt, the same questions sit open in `docs/TODO.md` item 16,
and no later ADR or STATUS decides any of them. **No stale-deferred** (same as last run).
## Auto-fixes applied
All safe/obvious (stale text contradicting code/reality, partial enumerations, broken
descriptions) — no logic, variable, secret, or task-order changes.
| ID | Sev | File | What |
|---|---|---|---|
| AF1 | high | `roles/reverse_proxy/meta/main.yml` | description still said DNS-01 + custom on-host image → rewrote to vanilla Caddy + HTTP-01 (matches the role since b7e919d) |
| AF2 | med | `roles/README.md` | base hardening + docker_host/reverse_proxy/public_dns build-state was stale → reconciled with STATUS |
| AF3 | med | `playbooks/README.md` | stale "docker_host has no tasks" note; added missing `dns.yml` + `offsite.yml` bullets |
| AF4 | low | `roles/public_dns/README.md` | "askari in M4" → askari + `*.askari` records applied in M4a |
| AF5 | low | `scripts/README.md` | added the missing `check-tags.py` entry (run by `make lint`) |
| AF6 | med | `terraform/README.md` | added `modules/hetzner_vm` + `environments/offsite` (the one applied env) |
| AF7 | low | `terraform/environments/offsite/providers.tf` | verified-stamp `cax11@hel1``cx23@hel1` (actual server) |
| AF8 | low | `terraform/modules/hetzner_vm/variables.tf` | `server_type` example `cax11 (ARM)``cx23 (x86) or cax11 (ARM)` |
| AF9 | med | `inventories/production/group_vars/all/public_dns.yml` | wildcard comment "cert via DNS-01" → ACME HTTP-01 (M4a) |
| AF10 | high | `docs/CAPABILITIES.md` | reverse-proxy candidate `Traefik``Caddy (ADR-024)`; public DNS "apply pending" → "applied (M1)" |
| AF11 | low | `README.md` | Documentation ADR list extended ADR-017 → ADR-024 |
## Open findings (prioritised)
### High
- **O1 — drift — STATUS.md:41 (+45-48) ↔ 33-34** *(new)*: docker_host still appears in
the "Scaffolded but empty — NOT implemented" table as a no-op, contradicting its own
"Built + applied" rows and the real tasks file. Reword the scaffold row + closing
paragraph (left for the operator — STATUS is the ground-truth doc).
- **O2 — consistency — ADR-004:105,131 ↔ ADR-022** *(recurring)*: ADR-004 says backup is
"not in scope of this repo"; ADR-022 defines a full in-repo backup doctrine. Repoint
ADR-004 at ADR-022 (ADR↔ADR design decision — report).
- **O3 — consistency — ADR-024 Consequences ↔ ADR-008:70/017:27,88/019:52** *(new)*:
ADR-024 claims it updated ADR-017's Traefik prose to Caddy; it didn't, and ADR-008/019
still say Traefik too. Either finish the rename or soften ADR-024's claim.
- **O4 — conformance — ADR-023:7-8,77-80 ↔ ADR-016/017/018** *(recurring)*: ADR-023
claims ADRs 001018 were restructured to lead with `## Status`, but 016/017/018 still
open with `## Context` and bury Status. Fix the three ADRs or correct ADR-023 §6.
### Medium
- **O5 — ADR-004:48-50** *(recurring)*: service-role file table omits ACCESS.md +
BACKUP.md rows (now mandated by CLAUDE.md/ADR-021/022).
- **O6 — ADR-002:82** *(recurring)*: `make deploy PLAYBOOK=upgrade` cited as real, but no
`upgrade.yml` exists and ADR-011 is unbuilt — needs a `(planned)` caveat.
- **O7 — CAPABILITIES:150-155 ↔ STATUS:29** *(recurring)*: nvim/tmux listed as a
"confirmed exclusion" while `dev_env` installs them on ubongo; needs a control-host
carve-out (not a token swap, so left from AF10).
- **O8 — dev_env tasks (include_tasks + per_user.yml:4-9)** *(recurring)*: untagged
`set_fact dev_env__home` preflight + include without `apply: tags:`; a partial
`--tags users|config` run breaks (base guards this; dev_env doesn't).
- **O9 — inventories/production/hosts.yml** *(recurring)*: header claims TF-generated but
it's hand-maintained (carries ubongo, omits offsite_hosts); `tf-inventory` would drop
ubongo. Make the header honest.
- **O10 — group_vars/all/vars.yml:42 ↔ ADR-007** *(recurring)*: ubongo `10.20.10.151` is
in no ADR-007 subnet and undocumented; `base__firewall_control_addr` depends on it.
- **O11 — terraform tfvars.example (both envs)** *(recurring)*: `pve01` vs ADR-007's
`pve0`; verify the real node name before changing.
- **O12 — roles/reverse_proxy/** *(new)*: first built+applied service role, but missing
SECURITY/VERIFY/ACCESS/BACKUP.md. (Recorded judgement: public_dns is exempt — control-
node external-API role, not a host service.)
- **O15 — runbooks/new-host.md Part E** *(recurring)*: still describes an `ansible` user
on ubongo; STATUS says ubongo is managed as `sjat` (ansible-user bootstrap pending).
- **O18 — ADR-007/009/016 internal-zone name** *(new)*: `boma.baobab.band` vs target
`boma.wingu.me` used inconsistently across the doc set after M1; state the transition
in one place.
### Low
O13 (See-also vs `## Related` in ADR-012/013/015/016/017/018 — recurring), O14
(docs/README + inventories/README narrow enumerations — recurring), O16 (.zshrc rclone
alias + unguarded direnv hook — recurring), O17 (oh_my_posh zen.toml tasks missing
`config` tag — recurring), O19 (ADR-009:122 `nyumbani` example after retirement —
recurring), O20 (ROADMAP M2 CAX11/ARM vs cx23/x86 — new), O21 (ADR-020 "ports will be
added in M4" stale; already opened in M4a — new), O22 (ADR-024 body still asserts custom-
image obligation contradicting its revised Status — new), O23 (`netbird_coordinator` vs
`netbird` role name across ADRs/ROADMAP/plan — new), O24 (`*.boma.<domain>` vs
`*.<boma-domain>` wildcard scope ADR-024 vs ROADMAP — new), O25 (`tags: [verify]` out of
the ADR-019 vocabulary in molecule verify — new), O26 (reverse_proxy templates lack
`ansible_managed` header — new), O27 (reverse_proxy vars in `group_vars/all/` not
`offsite_hosts/` — new), O28 (capacity-scan.py ignores `offsite.yml` — new), O29
(offsite.yml duplicates empty groups from hosts.yml, undocumented merge — new).
Full detail + suggested fixes in `2026-06-14-findings.json`.
## Themes worth a deliberate pass
1. **Finish the Traefik→Caddy rename** (O3, and ADR-024 over-claimed it was done). One
sweep across ADR-008/017/019 closes it.
2. **STATUS docker_host self-contradiction** (O1) — quick, but it's the ground-truth doc.
3. **ADR-024 internal consistency** (O22) — the role went vanilla/HTTP-01 but the ADR
body still mandates the custom image; reconcile §2/§3/Consequences with its own Status.
4. **dev_env tag-isolation** (O8) — the one real conformance bug with runtime impact;
mirror base's `apply: tags:` guard.
5. **First service-role doc quartet** (O12) — reverse_proxy is the template for every
future service role; getting SECURITY/VERIFY/ACCESS/BACKUP.md right now pays forward.
## Follow-up prompt
> Work the open findings from `docs/reviews/2026-06-14-review.md`. Priority order:
> (1) **O1** — fix the STATUS.md docker_host contradiction (it's built+applied, not a
> no-op; reword the "Scaffolded but empty" row + the 45-48 paragraph).
> (2) **O3 + O22** — finish the Traefik→Caddy rename in ADR-008:70, ADR-017:27,88,
> ADR-019:52, and reconcile ADR-024's body (§2 custom image, §3 NetBird, Consequences)
> with its own revised HTTP-01 Status note.
> (3) **O2 + O5** — repoint ADR-004's "backup not in scope" line at ADR-022 and add
> ACCESS.md + BACKUP.md rows to its service-role file table.
> (4) **O8** — add `apply: tags: [users, config]` to dev_env's per_user.yml include and
> tag the `dev_env__home` set_fact `always`; add a Molecule assertion that a partial
> `--tags config` run still resolves the home dir.
> (5) **O12** — author the four service-role doc files for `roles/reverse_proxy/` from the
> templates (BACKUP.md = `backup__state: false`, re-issuable certs).
> (6) **O4** — restructure ADR-016/017/018 to lead with `## Status`, or correct ADR-023 §6.
> Then the medium drift items (O6 upgrade caveat, O7 nvim/tmux carve-out, O9 hosts.yml
> header, O15 new-host Part E, O18 internal-zone naming). Run `make lint` after each
> batch; commit per CLAUDE.md git conventions.

View file

@ -50,6 +50,13 @@ Don't install these until their trigger lands — then add them here and to
- **The venv-activate hook** — this repo expects the Python `.venv` active for Bash
commands. If you use the user-level `~/.claude/hooks/activate-venv.sh` pattern,
replicate it; otherwise `source .venv/bin/activate` per session after `make setup`.
- **Forgejo registry login (for image pushes)**`make caddy-image-push` /
`molecule-image-push` need the Docker daemon authenticated to
`forgejo.nyumbani.baobab.band`. Run **`make registry-login`** once per machine: it reads
`vault.forgejo.registry_token` from the vault and does `docker login --password-stdin`
(no interactive prompt, so an agent can complete a push). The token is operator-minted
(Forgejo → Settings → Applications → Generate Token, package read+write) and set via
`make edit-vault`; until then `registry-login` prints how to obtain it. (2026-06-17 kaizen.)
## 4. A note on user-level settings
@ -58,6 +65,23 @@ The dangerous-mode permission prompt (`skipDangerousModePermissionPrompt`) is a
"operator/agent error" threat, prefer leaving that prompt **on** unless you
deliberately rely on bypass mode.
## Environment gotchas
Migrated from `docs/FRICTION.md` by the 2026-06-10 kaizen review — surprises that bite
on this kind of host/toolchain:
- **Hooks (and any new `.claude/settings.json`) added mid-session don't activate until a
Claude Code restart.** The settings watcher only tracks settings files that existed at
session start; opening `/hooks` and dismissing does *not* load them. Fresh sessions
load them normally — restart after adding a hook.
- **pre-commit stashes *unstaged* changes before running hooks**, so a partial commit of
interdependent files can revert one and fail (e.g. an `ansible.cfg` change left
unstaged). Commit interdependent changes together, or stage the config change first.
- **`rbw sync` is required after adding a Vaultwarden item before `rbw get` finds it**
(the local cache is stale otherwise).
- **This shell is zsh** — unquoted `$VAR` does *not* word-split, so a variable holding a
file list is passed as a single argument. Use explicit args/arrays.
## Verifying
After setup, a quick check: the project commands (`/review-repo`, `/capacity-review`,

View file

@ -0,0 +1,229 @@
# Runbook — Local VM integration testing
## When to use this
Run a local VM integration test before deploying any change that touches:
- **nftables / firewall rules** (the `firewall` concern of `base`)
- **sshd configuration** (listener address, port, key types, `base` hardening)
- **boot ordering or kernel parameters** (systemd units, sysctl)
- **Docker host networking** (`docker_host` DNAT rules, published-port forwarding, `daemon.json`)
These are the change classes that Molecule (ADR-008 Level 1) cannot catch: they require
a real kernel reboot to surface. This harness is the concrete tool for ADR-008 Level 2/3
(see ADR-025) and directly operationalises two standing rules:
- **"Test risky infra before live deploy"** (standing rule, ubongo memory) — firewall/sshd/boot changes must be tested on a real VM with a real reboot before touching a live host.
- **FRICTION 2026-06-17 #6 — validate reboot-recovery before retiring the break-glass** — the lesson crystallised from the mesh-hardening incident: confirm the host recovers from reboot *while you still have the break-glass open*, not after.
You do not need this runbook for pure-config changes (template rendering, package lists, user management) — Molecule covers those.
---
## First-deploy (one-time setup)
The `integration_test` role installs libvirt + QEMU + virtinst on ubongo and adds the
operator accounts (`sjat`, `claude`) to the `libvirt` and `kvm` groups.
```bash
make deploy PLAYBOOK=site LIMIT=ubongo TAGS=integration_test
```
**Re-login after this run** — group membership changes do not take effect in the current
session. The driver (`scripts/integration-vm.py`) requires both `libvirt` and `kvm`
group membership to create and manage VMs.
The golden Debian-13 genericcloud qcow2 image is downloaded lazily on the first run
(one-time cost, ~500 MB); subsequent runs reuse the cached image.
---
## Running a cycle
### Makefile interface (recommended)
```bash
# Full cycle (provision → apply → reboot → assert → teardown on pass)
make test-integration HOST=askari
# With a specific cert tier
make test-integration HOST=askari CERTS=le-staging
# Keep the VM alive after the run (for manual inspection)
make test-integration HOST=askari KEEP=1
# Destroy all orphan integration VMs (name-prefix boma-it-*)
make test-integration-clean
```
`HOST` is a hostname from the production inventory (the profile `tests/integration/
profiles/<host>.json` must exist — see Adding a new profile below). `CERTS` defaults
to `internal`.
### Lower-level driver
The driver (`scripts/integration-vm.py`) exposes individual lifecycle steps for manual
or scripted use:
| Sub-command | What it does |
|---|---|
| `up` | Ensure golden image → create ephemeral overlay → cloud-init seed → boot |
| `apply` | Run the site playbook against the transient inventory (real apply) |
| `reboot` | `virsh reboot` + wait for a verified reboot (boot-id change) — the step Molecule cannot do |
| `assert` | Run `tests/integration/verify.yml` (outcome assertions) |
| `cycle` | `up``apply``reboot``assert``down` (default: destroy on pass) |
| `down` | Destroy the VM + overlay |
| `prune` | Destroy all `boma-it-*` VMs + overlays (orphan cleanup) |
| `console` | Print the VM's captured serial-console log |
```bash
# Example: step through manually
python3 scripts/integration-vm.py up --host askari
python3 scripts/integration-vm.py apply --host askari
python3 scripts/integration-vm.py reboot --host askari
python3 scripts/integration-vm.py assert --host askari
python3 scripts/integration-vm.py down --host askari
```
---
## Cert tiers
| Tier | Flag | Use when |
|---|---|---|
| `internal` | `CERTS=internal` (default) | Incident repro, firewall/sshd/boot changes where certs are not under test. Zero deps, instant. |
| `le-staging` | `CERTS=le-staging` | Testing the Caddy DNS-01 ACME path, cert renewal logic, or the `caddy-gandi` plugin. Real cert files, untrusted root, effectively no rate limits. Requires `vault.gandi.pat`. |
| `le-prod-wildcard` | `CERTS=le-prod-wildcard` | Verifying TLS behaviour with a real trusted cert. On-demand only — accepted risk R6 (`docs/security/accepted-risks.md`): the production Gandi PAT reaches an ephemeral VM and transient TXT records are written into the real `wingu.me` zone. |
> A deliberate "no-egress" scenario (reproducing FRICTION 2026-06-17 #4 — the
> `netbird-server` GeoLite2 FATAL-loop when NAT masquerade is wiped) **must** use
> `CERTS=internal`: the egress loss is the fault being simulated, and ACME requires egress.
---
## Diagnostics and inspecting a failed VM
### Where diagnostics land
Diagnostics from every run are captured in:
```
~/integration-runs/<timestamp>-<host>/
```
This directory is gitignored. On a failed assert step, the driver dumps:
- `nft list ruleset` — the live nftables state at failure
- `docker ps -a` — container states
- `ss -tlnp` — listening sockets
- `journalctl -b` — full boot log
- `systemd-analyze critical-chain` — boot timing
- Serial console capture (on boot/SSH failure — the automated equivalent of the Hetzner
console, addressing FRICTION 2026-06-17 #5)
The agent reads these directly from `~/integration-runs/` — no manual download needed.
### Inspecting a kept or failed VM
When a run fails or when `KEEP=1` is passed, the VM is left running. Connect to it:
```bash
# Serial console (no SSH needed — useful when SSH is the fault)
python3 scripts/integration-vm.py console --host askari
# or directly:
virsh console boma-it-askari
# Exit with Ctrl-]
# SSH (as the ansible user, IP from virsh)
virsh domifaddr boma-it-askari --source lease
ssh ansible@<IP>
# List all integration VMs
virsh list --all | grep boma-it-
```
### Cleanup
```bash
# Destroy a specific VM
python3 scripts/integration-vm.py down --host askari
# Reap all orphans
make test-integration-clean
# or:
python3 scripts/integration-vm.py prune
```
---
## Safety invariants
These make the test tool itself safe — the harness cannot reach or modify production:
1. **Single-host transient inventory** — the playbook apply runs against a generated
single-host inventory (`ansible_host=<VM lease IP>`). No real host is ever in scope.
2. **In-VM coordinator only** — "be askari" points NetBird at the coordinator running
inside the VM itself (localhost endpoint). The VM forms its own one-node mesh; it
never enrols in the real NetBird mesh.
3. **Isolated NAT network** — test VMs sit on a dedicated libvirt NAT network.
Outbound NAT provides ACME/image-pull access, but the VM is not reachable from
the LAN (`10.20.x`) or the real mesh.
---
## Resource constraints
The default VM profile is ~2 vCPU / 3 GiB RAM / 20 GiB thin-provisioned overlay. The
driver enforces **one integration VM at a time** (refusing to start if another
`boma-it-*` VM is already running) and refuses to start below the free-RAM threshold
(~13 GiB available on ubongo at baseline, per ADR-025).
**Do not run a test-integration cycle alongside a Level-4 browser session**
(Chromium/Playwright, ADR-017) — both compete for ubongo RAM. The resource guard is the
enforcement mechanism, not a suggestion.
---
## Adding a new profile
To make the harness "be" a different host:
1. Create `tests/integration/profiles/<hostname>.json` — specifies which roles to apply
and base VM sizing for that host.
2. Create `tests/integration/overrides/<hostname>.yml` — the explicit stub overlay:
cert tier, in-VM coordinator endpoint (if the host runs the coordinator),
`ansible_host` placeholder, and any other variables that must differ from the real
inventory (e.g. public DNS → local resolution, geo-DB disable for coordinator).
3. Add assertions to `tests/integration/verify.yml` (or extend an existing task with a
`when: inventory_hostname == '<hostname>'` guard) for any host-specific outcomes.
4. Run `make test-integration HOST=<hostname>` to validate the new profile.
All stubs must be explicit in the overlay — the real inventory is never edited.
---
## Reproducing the 2026-06-17 incident
The acceptance test for the harness (ADR-025) deliberately reproduces the incident:
1. Run with today's `base` (firewall on, no `docker_host` container-forward drop-in):
```bash
make test-integration HOST=askari CERTS=internal
```
The assert step **must FAIL** after reboot (Docker forwarding dead, published ports
unreachable). If it passes, the harness is not faithful.
2. Implement the `docker_host` container-forward rules (FRICTION 2026-06-17 #1 fix) and
re-run. The assert step **must PASS** across the reboot.
This round-trip proves: (a) the harness faithfully reproduces the incident, and (b) the
fix survives a real reboot.
---
## Related
- ADR-025 — decision record for this harness (approach, cert tiers, safety invariants)
- ADR-008 — testing methodology; this is Level 2/3
- `docs/security/accepted-risks.md` R6 — `le-prod-wildcard` accepted risk
- `docs/FRICTION.md` — 2026-06-17 signals that motivated this runbook

View file

@ -0,0 +1,144 @@
# Runbook — Enrolling a NetBird client (road-warrior device)
Joins a **client/road-warrior device** (laptop, desktop, phone) to the boma NetBird mesh
so it can reach `ubongo` and other peers from anywhere. The self-hosted coordinator is on
`askari` (ADR-016, M4b); enrollment lands a device on the `100.64.0.0/10` overlay.
> **Hosts vs clients.** Managed **Linux hosts** join via the `base` role's `mesh` concern
> (`base__mesh_enabled: true` + the reusable key in `vault.netbird.setup_key`) — see
> ADR-016 / the `base` README, *not* this runbook. This runbook is for **user devices**
> NetBird doesn't manage with Ansible.
verified: NetBird client install + self-hosted `--management-url` flow · docs.netbird.io
(`/get-started/install/windows`, `/get-started/cli`) · 2026-06-17
## Prerequisites
- The coordinator's first-boot `/setup` admin exists and you can log in at
`https://netbird.askari.wingu.me`.
- **Auth, pick one:**
- **SSO** (recommended for a personal device) — your dashboard account; no secret to copy.
- **Setup key** — dashboard → **Settings → Setup Keys** → a reusable key (mint a
client-specific one for clean ACL grouping, or reuse the existing reusable key).
- Local **admin rights** on the device (the client installs a service).
- **Coordinator facts:** management URL `https://netbird.askari.wingu.me`; `ubongo`
= `100.99.146.14` (`ubongo.netbird.selfhosted`); `askari` = `100.99.226.39`.
---
## Part A — Windows 11
1. **Install:** download + run the MSI **https://pkgs.netbird.io/windows/msi/x64**
(official x64 client; installs the tray app + the `netbird` service).
2. **Connect** from an **elevated** Windows Terminal / PowerShell ("Run as administrator"):
```powershell
netbird up --management-url https://netbird.askari.wingu.me
```
A browser opens — sign in with your dashboard account. (SSO won't open a browser?
use a key: `netbird up --setup-key <KEY> --management-url https://netbird.askari.wingu.me`.)
3. Proceed to **Part C** (verify).
---
## Part B — Other platforms (same management URL)
- **macOS / Linux desktop:** install the client (macOS: NetBird app / Homebrew; Linux:
`pkgs.netbird.io` per the distro — same apt/rpm flow as `base`'s `mesh` concern), then
`netbird up --management-url https://netbird.askari.wingu.me` (Linux: prefix `sudo`).
- **Android / iOS:** install the **NetBird** app, then in **Settings → Advanced /
Server** set the management server to `https://netbird.askari.wingu.me` **before**
logging in; connect and complete the SSO login. (Setup keys are supported in-app too.)
---
## Part C — Verify + use
```sh
netbird status # expect: Management: Connected, Signal: Connected, a 100.x NetBird IP
netbird status -d # peer detail — ubongo (100.99.146.14) + askari (100.99.226.39) listed
```
Reach `ubongo` over the mesh:
```sh
ssh sjat@100.99.146.14 # or: ssh sjat@ubongo.netbird.selfhosted
```
**SSH auth is separate from the mesh:** `ubongo` is key-only (passwords disabled), so the
device needs an SSH key authorised for `sjat@ubongo`. The mesh provides the network path;
the SSH key provides auth.
---
## Troubleshooting — mesh drops / SSH to `ubongo` times out
Symptom: SSH to `ubongo` (or any peer) times out for minutes and recovers on its own;
`netbird status` shows **Management/Signal: Disconnected** or peers stuck **Connecting**.
verified: client DNS/relay behaviour + NRPT scope read from a 0.72.4 debug bundle;
mitigations per docs.netbird.io (`/manage/dns/troubleshooting`,
`/help/troubleshooting-client`) · 2026-06-18
**1. Triage — is it your device or the coordinator?** On the device:
```sh
netbird status -d # Management/Signal Connected? peers P2P/Relayed?
nslookup netbird.askari.wingu.me # coordinator FQDN
nslookup pkgs.netbird.io # a PUBLIC name — control test
```
If the relay/handshake errors say `lookup netbird.askari.wingu.me: no such host` **and**
a *public* name (`pkgs.netbird.io`) also fails to resolve, your **local resolver is
dead** — the coordinator and `ubongo` are almost certainly fine. NetBird only manages
`*.netbird.selfhosted` resolution (a single NRPT rule), so it is **not** the cause.
Confirm from the other side if you can: the dashboard shows peer *last-seen*; `askari`/
`ubongo` staying green ⇒ the fault is your device's network.
**Why it cascades:** NetBird re-resolves the coordinator FQDN on every reconnect. A
network transition (Wi-Fi ↔ phone hotspot, sleep/wake) that briefly kills DNS means it
can't reach management/signal/relay — and since `ubongo` is **relay-only** (below), there
is no direct path to fall back to, so SSH dies until DNS recovers.
**2. Make the device resilient:**
- **Reliable resolvers** — set the device's DNS to public resolvers (`1.1.1.1`, `8.8.8.8`)
rather than a network-handed or homelab-internal resolver that's unreachable off-LAN.
Windows: inspect with `Get-DnsClientServerAddress`.
- **Pin the coordinator** so a DNS hiccup can't strand the client — add to the hosts file
(`C:\Windows\System32\drivers\etc\hosts` as admin, or `/etc/hosts`):
```
77.42.120.136 netbird.askari.wingu.me
```
`askari`'s stable WAN IP; TLS still validates on the hostname. Removes the multi-minute
reconnect deadlocks.
**3. Break-glass — reach `ubongo` without the mesh.** When the mesh is down you still need
a way in. On the home LAN, go straight to `ubongo`'s wired address (bypasses the mesh and
coordinator DNS entirely):
```sh
ssh sjat@10.20.10.151 # ubongo eno1 (LAN) — verify this works from your device NOW
```
> ⚠️ This works **today** only because `ubongo`'s host-firewall default-deny is not yet
> applied. When the deferred mesh-hardening lands (SSH only on `wt0`), this path closes
> unless a break-glass SSH rule is added to the firewall catalog. That hardening **must**
> keep a non-mesh break-glass (catalog SSH rule from a trusted LAN/admin source) — else a
> DNS/mesh outage = full lockout. (ADR-021 break-glass.)
**Why `ubongo` is relay-only (and P2P is not the fix).** Peers connect to `ubongo` as
`Relayed`, never `P2P`: its `nftables` default-deny drops the inbound UDP that ICE
hole-punching needs (egress is open, so STUN itself succeeds). This is the **intended
current posture** — P2P / NAT-traversal is the *deferred mesh-hardening* (ADR-016/020,
STATUS.md). Enabling it needs a firewall-catalog UDP entry **plus** an `accepted-risks.md`
deviation or ADR amendment, and OPNsense NAT work — and it would **not** have prevented a
DNS-driven outage (a re-handshake still needs signal, which needs DNS). Tracked as future
hardening, not a quick fix.
---
## Notes
- **Split-tunnel:** NetBird routes only the `100.x` overlay by default — normal/work
networking is unaffected.
- **Persistence:** the service auto-starts on boot and reconnects; the tray app has
Connect/Disconnect; CLI `netbird down` / `netbird up` (no flags after first setup).
- **Troubleshooting***"failed while getting Management Service public key"* / won't
register: confirm `https://netbird.askari.wingu.me` loads in a browser from the device
(DNS + TLS + the gRPC routing through Caddy are reachable), the URL is exact, and the
terminal is elevated. For peers stuck Disconnected/Connecting or SSH-to-`ubongo`
timeouts that recover on their own, see **Troubleshooting — mesh drops** above.
- **Removing a device:** `netbird down` then uninstall; revoke its peer in the dashboard
(and the setup key if one-off).

View file

@ -2,7 +2,8 @@
## Prerequisites
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not)
- Proxmox VM template exists (Debian 13 cloud-init image — see below if not).
Not needed for the control node `ubongo`, which is bare-metal (Part E).
- `rbw` is installed and unlocked (`rbw unlock`) so the vault password resolves from Vaultwarden
- The host's intended hostname and IP are decided
@ -57,9 +58,9 @@ locals {
}
```
Terraform clones the cloud-init template from Part A, sets the cloud-init values
(hostname, SSH key, IP/gateway), and writes the host's DNS A record. See ADR-009
for the full handoff and the `vms` output → inventory data contract.
Terraform clones the cloud-init template from Part A and sets the cloud-init values
(hostname, SSH key, IP/gateway). It writes no DNS records — the `dns` role owns the
internal zone. See ADR-009 for the full handoff and the `vms` output → inventory data contract.
---
@ -67,7 +68,7 @@ for the full handoff and the `vms` output → inventory data contract.
```bash
make tf-plan TF_ENV=production # review — confirm only the new VM is added
make tf-apply TF_ENV=production # create the VM + write its DNS A record
make tf-apply TF_ENV=production # create the VM (no DNS records written)
make tf-inventory TF_ENV=production # regenerate inventories/production/hosts.yml
```
@ -108,29 +109,47 @@ make check PLAYBOOK=site
# Should report no changes
```
> **Pre-flight before lockout-risky changes (firewall / sshd / boot):** before applying
> any change that touches nftables rules, SSH configuration, or boot ordering, run
> `make test-integration HOST=<name>` and confirm reboot-recovery on the local VM
> **while the break-glass (Proxmox console / Hetzner console) is still open**. Do not
> retire the break-glass until the integration test passes. See
> `docs/runbooks/integration-testing.md` and ADR-025.
---
## Part E — Control node (manual exception)
## Part E — Control node (`ubongo`, manual exception)
The control node runs Terraform and Ansible, so it cannot be created by the
Terraform it hosts (chicken-and-egg). It is the **one** host provisioned manually —
see ADR-009 and the control-node section of ADR-005. Use the template from Part A:
Terraform it hosts (chicken-and-egg). It is `ubongo`, a dedicated **physical**
machine outside the cluster — not a Proxmox guest. It is the **one** host
provisioned manually. Rationale, hardware target, and recovery model: ADR-015.
```bash
# Clone the template by hand (Proxmox UI or qm clone)
qm clone 9000 <VMID> --name <hostname> --full
qm set <VMID> --memory 2048 --cores 2 \
--ciuser ansible \
--sshkeys /path/to/ansible_ed25519.pub \
--ipconfig0 ip=<IP>/24,gw=<GATEWAY>
qm start <VMID>
```
> **Current state (STATUS.md):** `ubongo` is today managed as the operator account
> `sjat` (`group_vars/control` sets `ansible_user: sjat`); it has **no** dedicated
> `ansible` service user yet. The dedicated-`ansible`-user bootstrap (step 2) is a
> **pending** item. Steps below describe the intended end state.
Then set up the Ansible environment on it (`make setup`, `make collections`, set up
`rbw` and `rbw unlock`) per ADR-005, and add it to `inventories/<env>/hosts.yml` under the
`control` group. Because the control node is not in `local.vms`, this is the only
case where editing `hosts.yml` by hand is expected — every other host comes from
`make tf-inventory`.
1. Install Debian 13 on the physical box by hand (no template to clone).
2. Create the `ansible` user and install its SSH public key. *(Pending for `ubongo`
currently managed as `sjat`; see the note above.)*
3. Set up the Ansible environment on it:
```bash
git clone <repo> ~/ansible
cd ~/ansible
make setup # venv + Python deps
make collections # Ansible collections
rbw login && rbw unlock # vault password from Vaultwarden (see rotate-secrets.md)
```
4. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016) — so it is
reachable over SSH from elsewhere.
5. Add `ubongo` to `inventories/<env>/hosts.yml` under the `control` group.
Because `ubongo` is not in `local.vms`, this is the only case where editing
`hosts.yml` by hand is expected. **Known limitation:** `make tf-inventory`
regenerates `hosts.yml` from Terraform outputs and will overwrite a hand-added
`control` entry — re-add `ubongo` after running it (preserving the control entry in
the generator is tracked separately, not yet built).
---

View file

@ -82,7 +82,52 @@ service clears the security bar — record any conscious deviation in
manual in review today, with the planned `/security-review` aggregating every
`roles/*/SECURITY.md` to automate it.
### 10. Commit
### 10. Write the per-service verification spec (services)
For a **service** role, copy `docs/testing/service-verify-template.md` to
`roles/<rolename>/VERIFY.md` and fill it in: the critical user journeys that define
"working" for this service, what good looks like, what is not browser-verifiable
(→ manual handoff), and the test data needed. This is the per-service backbone for the
Level 4 `/verify-service` check (ADR-008 / ADR-017) and is part of the pre-production
service-clearance gate (`docs/security/service-checklist.md`).
### 11. Write the per-service operational-access record (services)
For a **service** role, copy `docs/access/service-access-template.md` to
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
`access__log.loki_labels`, and `access__api``enabled` + endpoint + `firewall_ref` +
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
rendered from that data; the admin-API path must `firewall_ref` an entry in the
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
`/check-access <rolename>` proves the documented paths are live — part of the
service-clearance gate (`docs/security/service-checklist.md`).
### 12. Write the per-service backup record (stateful services)
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
`backup__paths`, `backup__dumps``cmd` + `dest` per logical dump — and `backup__quiesce`;
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
is rendered from that data. A **stateless** service sets `backup__state: false` with a
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
proves the declared state is captured — part of the service-clearance gate
(`docs/security/service-checklist.md`).
### 13. Pre-flight for lockout-risky roles
If the new role touches nftables rules, SSH configuration, or boot ordering, run a
local VM integration test and confirm reboot-recovery **before** deploying to a live
host and while the host's break-glass (Proxmox console / Hetzner console) is still
open:
```bash
make test-integration HOST=<target-host>
```
See `docs/runbooks/integration-testing.md` and ADR-025.
### 14. Commit
```bash
git checkout -b role/<rolename>

View file

@ -30,6 +30,28 @@ clear "run: rbw unlock" error rather than a hang.
---
## Break-glass — vault access during a full cluster outage
The control node `ubongo` (ADR-015) is the tool used to rebuild the cluster, so it
must be able to decrypt the vault even when Vaultwarden (if hosted on the cluster)
is down. `rbw` keeps a **local encrypted copy** of the Vaultwarden vault and decrypts
it **offline** with your Vaultwarden master password — no live server needed for
entries it has already synced. The recovery design therefore requires:
- `rbw` on `ubongo` (and on `mamba`, the break-glass laptop) has **synced at least
once** while Vaultwarden was reachable (`rbw sync`).
- Your **Vaultwarden master password** is kept **offline** — in a password manager on
`mamba` and on paper in a safe — independent of any cluster-hosted Vaultwarden.
There is always exactly one irreducible offline root secret; here it is the
Vaultwarden master password. Keep it recoverable without the cluster.
> **Verified (2026-06-11, ADR-014):** confirmed on `ubongo` with rbw 1.15.0 — with
> the Vaultwarden host unreachable, `rbw sync` fails but `rbw get boma-ansible-vault`
> still decrypts from the local cache. Re-verify after an `rbw` major-version bump.
---
## Rotating a single secret value
1. Ensure the agent is unlocked: `rbw unlock`

View file

@ -15,8 +15,14 @@ revisit (trigger).
|---|---|---|---|
| R1 | **Active supply-chain scanning deferred** — baseline hygiene *is* required (tiered image pinning per ADR-011 — stateful `tag@digest`, stateless rolling — prefer official/verified images; gitleaks), but images and dependencies are not actively vulnerability-scanned (Trivy/Grype) or signature-verified | Scanning only pays off with the capacity to triage its output; the realistic threat is opportunistic, not a targeted supply-chain attack | A monitoring/triage stack is live; hosting high-value data/finances for others; a relevant upstream compromise |
| R2 | **SELinux not used** — no SELinux mandatory access control | AppArmor — Debian-native and enforced via the CIS baseline — already provides MAC; adding SELinux means two MAC systems, non-native to Debian, for no real gain | A service that ships and requires its own SELinux policy; threat model shifts toward targeted attackers |
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and STUN (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh (NetBird v0.72.4 embeds STUN in the combined server — no separate Coturn) | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
| R5 | **No disk encryption on `ubongo`** — the control node's SSD (SanDisk X600 256 GB, TCG-Opal-capable but Opal unused) is unencrypted at rest, so it holds recovery-critical secrets in plaintext: the Ansible Vault password's `rbw` local cache and (future) Terraform state. Physical theft of the box would expose them | `ubongo` is always-on in a physically controlled location; compensating controls are a **BIOS supervisor password** and **disabled external/USB + PXE boot** (an attacker cannot trivially boot another OS to read the disk), and the offline-recoverable design means the irreducible root secret (Vaultwarden master password) is never stored on the box anyway. Full-disk encryption was weighed against the always-on/unattended-reboot requirement (LUKS+TPM auto-unlock or passphrase) and deferred for simplicity at this trust level | `ubongo` is relocated to a less-trusted physical location; the box starts holding additional high-value secrets; or a reinstall onto LUKS (TPM-sealed) is undertaken |
| R6 | **`le-prod-wildcard` integration runs** — when `CERTS=le-prod-wildcard` is passed to `make test-integration`, the production Gandi PAT (`vault.gandi.pat`) is passed to an ephemeral local test VM via the var overlay, and transient `_acme-challenge` TXT records are written into the real `wingu.me` DNS zone to satisfy the Let's Encrypt DNS-01 challenge. A compromised or long-lived test VM could exfiltrate the PAT; the real zone is briefly (seconds) modified | Scope is **on-demand only**`le-staging` is the default cert tier (`CERTS=internal` for incident repro); `le-prod-wildcard` is an explicit opt-in. Compensating controls: the VM is ephemeral and destroyed on success; it sits on an isolated libvirt NAT network (no LAN/mesh access); TXT records are auto-removed by Caddy immediately after validation; the PAT is not persisted inside the VM after the run. ADR-025 documents the cert-tier design and the three isolation invariants | The PAT is exfiltrated from a test VM; the `wingu.me` zone shows unexpected records; a `CERTS=le-prod-wildcard` run must be audited or the tier must be revoked |
| R7 | **`claude` AI-worker has `NOPASSWD:ALL` sudo on `ubongo`** — the automated AI-worker account can execute any command as root on the control node without a password prompt. A compromised or misbehaving agent session could make arbitrary root-level changes to ubongo | The account is **password-locked** (no interactive `claude` login; `NOPASSWD` sudo is the account's only escalation path, so there is no "su to claude + sudo" attack). `auditd` + Loki attribution (ADR-018) logs every `sudo` invocation with the originating user. The drop-in (`/etc/sudoers.d/claude-ai-worker`) is repo-managed via `base__ai_worker_user` — revocable in one commit + one deploy. Single-operator homelab; all changes in git; off-machine backups (ADR-022). Full rationale: ADR-015 amendment (2026-06-18) + ADR-021 §Sudo model. | The AI-worker executes a destructive action that cannot be rolled back via git; the account key is compromised; the threat model shifts toward targeted remote attackers |
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access**`askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
_Last reviewed: 2026-06-04. The prior gaps (full CIS hardening, SELinux/AppArmor,
_Last reviewed: 2026-06-20. The prior gaps (full CIS hardening, SELinux/AppArmor,
IDS) were re-challenged and **adopted rather than accepted**: CIS Debian L1+L2 + CIS
Docker, AppArmor (enforce), AIDE file-integrity, and Suricata network IDS are now
part of the security strategy (ADR-002). See STATUS.md / `docs/TODO.md` for build

View file

@ -47,7 +47,17 @@ This checklist is the generic **bar**. Each service answers it in its own
## Operability (security-adjacent)
- [ ] Logs go somewhere reviewable (central aggregation when available)
- [ ] Backup/restore is covered if the service holds state
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
reports the declared paths/dumps captured in the latest snapshot — or the service
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
service has a populated `roles/<service>/VERIFY.md` and its critical journeys
verified (ADR-008 Level 4 / ADR-017)
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
documented paths green — or a deviation is recorded in
`docs/security/accepted-risks.md`
> Deviations are allowed but must be **conscious**: record them in
> `docs/security/accepted-risks.md`, don't leave them implicit.

View file

@ -0,0 +1,484 @@
# Mesh VPN (NetBird) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Record the decision that boma's mesh VPN is NetBird (self-hosted on `askari`), by authoring ADR-016 and reconciling every doc that currently assumes OPNsense WireGuard or an undecided VPN.
**Architecture:** Documentation-only change. NetBird replaces ADR-007's VLAN-99 OPNsense WireGuard as the single remote-access overlay for `ubongo`, `askari`, and road-warrior clients; coordinator self-hosted off-site on `askari`; agent-per-host enrollment via the (unbuilt) `base` role; embedded local-user identity. The role/service implementation waits on the `base` role and service-role machinery that STATUS.md lists as not-yet-built — this plan settles the decision and the doc reconciliation only.
**Tech Stack:** Markdown only. Verification is the repo's pre-commit hooks (trailing-whitespace, end-of-file, gitleaks, ansible-lint, vault-encryption guard) plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks.
---
## Pre-flight (read once before starting)
- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked` (exit 0 = good); if not, stop and ask the user to `rbw unlock`.
- **Commit style:** one commit per task, imperative subject ≤72 chars.
- **Order matters:** Task 1 (ADR-016) lands first — every later task links to it.
- **Spec reference:** `docs/superpowers/specs/2026-06-05-mesh-vpn-netbird-design.md`.
- **Branch:** start by creating `chore/mesh-vpn-netbird-docs` off `main` (the controller does this before dispatching Task 1; do not implement on `main`).
---
## File map
| File | Action | Responsibility after change |
|---|---|---|
| `docs/decisions/016-mesh-vpn.md` | Create | Home of record for the NetBird mesh decision |
| `docs/decisions/007-network.md` | Modify | VLAN-99 WireGuard retired; askari rides the mesh + hosts the coordinator |
| `docs/decisions/015-control-host.md` | Modify | Resolve deferred item #1 (mesh = NetBird on askari) |
| `docs/security/accepted-risks.md` | Modify | Replace R3 placeholder with the concrete residual risk |
| `docs/CAPABILITIES.md` | Modify | VPN row decided: NetBird, self-hosted |
| `STATUS.md` | Modify | Two rows: NetBird coordinator + agent enrollment (designed, not built) |
| `CLAUDE.md` | Modify | ADR-016 in Further reading |
---
### Task 1: Author ADR-016 (the home of record)
**Files:**
- Create: `docs/decisions/016-mesh-vpn.md`
- [ ] **Step 1: Create the ADR file**
Create `docs/decisions/016-mesh-vpn.md` with exactly this content (preserve em-dashes —, backticks, table pipes, and the `verified:` stamps):
```markdown
# ADR-016 — Mesh VPN (NetBird, self-hosted on `askari`)
## Context
`ubongo` (ADR-015) needs remote SSH access from anywhere without exposing anything to
the public internet; ADR-015 deferred the mechanism. ADR-007 already commits to
WireGuard-via-OPNsense for the `vpn` VLAN (VLAN 99, `10.99.0.0/24`: `askari` + road
warriors), and `docs/CAPABILITIES.md` flagged NetBird (mesh) as a real alternative to
weigh. This ADR settles it.
## Decision
A single **NetBird** mesh is the sole remote-access overlay, self-hosted on `askari`,
**replacing** ADR-007's VLAN-99 OPNsense WireGuard.
The decision in four parts:
1. **Scope — mesh replaces WireGuard.** One overlay for `ubongo`, `askari`, and
road-warrior clients. ADR-007's VLAN-99 WireGuard design is retired.
2. **Control plane — self-hosted on `askari`.** Sovereignty (boma self-hosts
Vaultwarden, Forgejo, DNS), no third-party trust, and an off-site coordinator that
survives a homelab outage and stays out of the cluster it administers.
3. **Tool — NetBird.** Self-hosting selects NetBird (first-class, fully open-source
self-host). Tailscale would mean Headscale (third-party reimplementation, partial
parity) — ruled out below.
4. **Routing — agent on every Linux host**, not a subnet router. At boma's scale (25
hosts) the "agent everywhere" cost is trivial and the `base` role already runs
everywhere, so enrollment is one uniform task. Avoids a routing SPOF and gives
granular per-peer ACLs. OPNsense (FreeBSD) is the one non-agent exception
(`mgmt`/gateway reached by a single advertised route or LAN-side admin).
5. **Identity — embedded local users** (Dex in the management container); external SSO
(Zitadel/Keycloak) stays an optional future.
## Verified facts (ADR-014)
verified: NetBird self-hosting · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
— components management+signal+dashboard+relay/TURN(Coturn), **single container since
v0.65**; **built-in local users / embedded IdP since v0.62** (external OIDC optional);
ports TCP 80/443 + UDP 3478 behind a reverse proxy; lightweight Linux + Docker Compose host.
verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05 — AGPLv3 for
`management/`/`signal/`/`relay/`, BSD-3-Clause elsewhere; fully open source, no
open-core feature gating.
## Architecture
Data plane: peer-to-peer WireGuard. Control plane: NetBird, self-hosted on `askari`.
NetBird manages its own overlay addressing (default `100.64.0.0/10`); no boma VLAN is
allocated for it.
- `askari` (Hetzner, off-site, always-up) — runs the NetBird stack **and** is a peer.
- `ubongo` — agent.
- All Linux managed hosts — agent via the `base` role.
- Road-warrior clients (`mamba`, phone, work PC) — agent/app.
- OPNsense / `mgmt` — single non-agent exception.
## Security
- **ACLs mirror ADR-007 intent** (NetBird default-deny): mesh peers → `srv` metrics
ports only; admin peers (`ubongo`, `mamba`) → `srv` + `mgmt`; clients → least
privilege.
- **Enrollment via setup keys** stored in `vault.yml` (`vault.netbird.setup_key`),
consumed by `base`; prefer ephemeral/scoped keys.
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
- **New public surface on `askari`:** management API + dashboard (80/443) + Coturn
(3478). Mitigated by TLS + embedded-IdP login, source-IP limits where practical,
`base` hardening, and version-pinned NetBird (ADR-011) patched on boma's cadence.
Recorded as accepted-risk R3.
## Recovery & operations
- **Ansible stays off the mesh:** `ubongo` reaches the fleet by LAN IP (ADR-009); a
mesh/coordinator outage never blocks on-LAN runs.
- **Bootstrap order:** stand up the coordinator on `askari` → enroll `ubongo`
`base` enrolls the fleet.
- **Coordinator survival:** off-site on `askari` ⇒ mesh survives a homelab outage.
NetBird's management datastore is backed up encrypted off `askari` (synced to
`ubongo`/`mamba`); peers keep last-known config through a brief coordinator outage.
- **`askari` is Ansible-managed:** its own inventory group, `base` role, plus a
dedicated `netbird_coordinator` service role (one service = one role, ADR-004; with
`SECURITY.md`). Agent install/enrollment lives in `base`. NetBird server + agents are
version-pinned (ADR-011). boma's `dns` role stays authoritative for
`boma.baobab.band`; NetBird built-in DNS scoped/off.
## Status
Designed, not built — depends on the unbuilt `base` role and service-role machinery
(STATUS.md). This ADR records the decision and doc reconciliation; role tasks land when
`base` exists.
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config; the operator wants policy-based mesh access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Third-party trust for the control plane; against boma's self-hosting ethos. Its recovery benefit is matched by a self-hosted coordinator off-site on `askari`. |
| Tailscale + Headscale | Headscale is a third-party reimplementation with partial parity and no vendor support — weaker than NetBird's first-class self-hosting. |
| Coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes and dies with the homelab. `askari` instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` goes blind to `srv` when `ubongo` is down. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for one operator; embedded local users suffice. |
See also: ADR-007 (network — amended), ADR-015 (control host), ADR-002 (security),
ADR-011 (version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible
handoff), ADR-013 (heritage — V4 ran WireGuard; NetBird is translated, not transplanted).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/016-mesh-vpn.md`
Expected: Passed/Skipped (ansible-lint Skipped for non-YAML).
```bash
git add docs/decisions/016-mesh-vpn.md
git commit -m "Add ADR-016 (mesh VPN — NetBird self-hosted on askari)"
```
---
### Task 2: Amend ADR-007 (retire VLAN-99 WireGuard, askari on the mesh)
**Files:**
- Modify: `docs/decisions/007-network.md`
Read the file first, then make FOUR exact edits. Preserve em-dashes —, backticks, table pipes.
- [ ] **Step 1: Update the VLAN-99 row in the VLAN design table**
Find:
```
| 99 | `vpn` | `10.99.0.0/24` | WireGuard peers. `askari` (Hetzner) + road-warrior clients. |
```
Replace with:
```
| 99 | `vpn` | _(retired)_ | **Replaced by the NetBird mesh (ADR-016).** Remote access for `ubongo`, `askari`, and road-warrior clients rides a self-hosted NetBird overlay, not an OPNsense WireGuard subnet. `10.99.0.0/24` is freed. |
```
- [ ] **Step 2: Replace the VLAN-99 addressing subsection**
Find:
```
### VLAN 99 — vpn (10.99.0.0/24) — WireGuard
| Address | Host |
|---|---|
| `10.99.0.1` | OPNsense (WireGuard endpoint) |
| `10.99.0.2` | `askari` (Hetzner VPS) |
| `10.99.0.10`+ | Road-warrior clients |
```
Replace with:
```
### VLAN 99 — vpn — retired
The OPNsense WireGuard VPN (`10.99.0.0/24`) is **replaced by the NetBird mesh**
(ADR-016). Remote access for `ubongo`, `askari`, and road-warrior clients rides a
self-hosted NetBird overlay — data plane peer-to-peer WireGuard, control plane
NetBird self-hosted on `askari`. NetBird manages its own overlay addressing
(default `100.64.0.0/10`); no boma VLAN/subnet is allocated for it, and
`10.99.0.0/24` is freed.
```
- [ ] **Step 3: Update the two `vpn` rows in the OPNsense firewall-rules table**
Find:
```
| `vpn` | `srv` (metrics ports) | allow (monitoring) |
| `vpn` | `mgmt` | allow (administration from askari) |
```
Replace with:
```
| mesh peers | `srv` (metrics ports) | allow (monitoring) — enforced by NetBird ACLs, not OPNsense (ADR-016) |
| mesh peers | `mgmt` | allow (administration) — enforced by NetBird ACLs (ADR-016) |
```
- [ ] **Step 4: Rewrite the "External monitoring — askari" section**
Find:
```
`askari` (Hetzner VPS) connects via WireGuard to OPNsense (`10.99.0.1`).
Its peer address is `10.99.0.2`. OPNsense routes `10.99.0.0/24` into the VPN
tunnel and allows `askari` narrow access to `srv` metrics endpoints and `mgmt`
for administration.
`askari` is provisioned and managed independently of the Proxmox cluster — it
must be reachable even when the homelab is down (its entire purpose).
FQDN: `askari.baobab.band`.
```
Replace with:
```
`askari` (Hetzner VPS) is a peer on the **NetBird mesh** (ADR-016) and also **hosts
the self-hosted NetBird coordinator** (management/signal/relay). It reaches `srv`
metrics endpoints and `mgmt` for administration over the mesh, scoped by NetBird
ACLs — no OPNsense WireGuard tunnel and no `10.99.0.0/24` routing.
`askari` is provisioned and managed independently of the Proxmox cluster — it must
be reachable even when the homelab is down (its entire purpose), which is also why
the mesh coordinator lives here: an off-site control plane survives a homelab outage.
FQDN: `askari.baobab.band`.
```
- [ ] **Step 5: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/007-network.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/007-network.md
git commit -m "ADR-007: retire VLAN-99 WireGuard for the NetBird mesh (ADR-016)"
```
---
### Task 3: Resolve ADR-015 deferred item #1
**Files:**
- Modify: `docs/decisions/015-control-host.md`
Read the file first, then make THREE exact edits.
- [ ] **Step 1: Update provisioning step 3**
Find:
```
3. Join the mesh VPN (choice deferred — see below).
```
Replace with:
```
3. Join the mesh VPN — NetBird, self-hosted on `askari` (ADR-016).
```
- [ ] **Step 2: Update the Access & security mesh line**
Find:
```
- Remote access is via the **mesh VPN** (choice deferred). SSH to `ubongo` over the
mesh; nothing is published to the public internet — this stays inside ADR-002.
```
Replace with:
```
- Remote access is via the **mesh VPN** — NetBird, self-hosted on `askari` (ADR-016).
SSH to `ubongo` over the mesh; nothing is published to the public internet — this
stays inside ADR-002.
```
- [ ] **Step 3: Resolve deferred item #1**
Find:
```
1. **Mesh VPN choice** — Tailscale vs NetBird, hosted vs self-hosted. Recovery
dimension: a hosted coordinator keeps the mesh up when the cluster is down; a
self-hosted coordinator must live off-cluster (on `ubongo`), never on the fleet,
or it recreates the chicken-and-egg.
```
Replace with:
```
1. **Mesh VPN choice — RESOLVED (ADR-016):** NetBird, self-hosted on `askari`
(off-site, so it survives a homelab outage and stays out of the cluster it
administers). Replaces ADR-007's OPNsense WireGuard.
```
- [ ] **Step 4: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/015-control-host.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/015-control-host.md
git commit -m "ADR-015: resolve mesh-VPN deferral — NetBird on askari (ADR-016)"
```
---
### Task 4: Replace accepted-risks R3 with the concrete residual risk
**Files:**
- Modify: `docs/security/accepted-risks.md`
Read the file first, then make ONE exact edit. (The row is long — match it whole.)
- [ ] **Step 1: Replace the R3 row**
Find:
```
| R3 | **Mesh-VPN coordinator dependency (pending VPN choice)** — remote SSH to the control node `ubongo` (ADR-015) rides a mesh VPN whose coordination plane may be a third party (e.g. hosted Tailscale/NetBird) | A hosted coordinator keeps the mesh up when the cluster is down, which *helps* recovery; nothing is exposed to the public internet (ADR-002 preserved). Provisional — finalised when the VPN is chosen (separate discussion) | The VPN choice is settled (replace this entry with the concrete decision); a self-hosted coordinator is adopted; the provider's trust/security posture changes |
```
Replace with:
```
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
```
- [ ] **Step 2: Bump the "Last reviewed" date**
Find:
```
_Last reviewed: 2026-06-05. The prior gaps
```
This already reads `2026-06-05` (today) from the previous work, so **no change is needed** — confirm it says `2026-06-05` and move on. (If it shows an earlier date, set it to `2026-06-05`.)
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
Expected: Passed/Skipped.
```bash
git add docs/security/accepted-risks.md
git commit -m "accepted-risks: R3 now the concrete NetBird coordinator risk"
```
---
### Task 5: Update the CAPABILITIES VPN row
**Files:**
- Modify: `docs/CAPABILITIES.md`
Read the file first, then make ONE exact edit.
- [ ] **Step 1: Replace the VPN / remote access row**
Find:
```
| VPN / remote access | Netbird · *or* OPNsense WireGuard | P | candidate | Secure remote access to `srv`/`mgmt` | ⚠️ ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real alternative to weigh |
```
Replace with:
```
| VPN / remote access | NetBird (self-hosted on `askari`) | P | core | Secure mesh remote access to `srv`/`mgmt` | **Decided (ADR-016):** NetBird mesh replaces ADR-007 OPNsense WireGuard |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
Expected: Passed/Skipped.
```bash
git add docs/CAPABILITIES.md
git commit -m "CAPABILITIES: VPN decided — NetBird self-hosted (ADR-016)"
```
---
### Task 6: Add NetBird rows to STATUS.md
**Files:**
- Modify: `STATUS.md`
Read the file first, then make ONE exact edit (add two rows after the `ubongo` row).
- [ ] **Step 1: Add the two rows**
Find:
```
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
```
Replace with that SAME line followed by the two new rows:
```
| `ubongo` — physical control / AI-worker host | ADR-015 | Replaces the cluster control VM with a dedicated always-on x86 box outside the cluster. Decision recorded; box not yet acquired/installed, not in inventory. |
| NetBird mesh — coordinator on `askari` | ADR-016 | Self-hosted NetBird control plane (management/signal/relay) on askari; replaces ADR-007 WireGuard. Decision recorded; not deployed (askari + service-role machinery not built). |
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: Passed/Skipped.
```bash
git add STATUS.md
git commit -m "STATUS: record NetBird mesh (coordinator + base enrollment)"
```
---
### Task 7: Link ADR-016 from CLAUDE.md
**Files:**
- Modify: `CLAUDE.md`
Read the file first, then make ONE exact edit.
- [ ] **Step 1: Add the Further reading row after Network topology**
Find:
```
| Network topology | `docs/decisions/007-network.md` |
```
Replace with that SAME line followed by the new row:
```
| Network topology | `docs/decisions/007-network.md` |
| Mesh VPN (NetBird, self-hosted) | `docs/decisions/016-mesh-vpn.md` |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: Passed/Skipped.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: link ADR-016 (mesh VPN)"
```
---
### Task 8: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: Confirm no doc still treats OPNsense WireGuard / `10.99` as the active remote-access path, and no "pending/deferred VPN" language remains**
Run:
```bash
grep -rniE "choice deferred|pending VPN choice|10\.99\.0|WireGuard (endpoint|peers|to OPNsense)" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the ONLY hits are in `007-network.md` and `016-mesh-vpn.md`, where they describe the **retirement** of `10.99.0.0/24` (e.g. "`10.99.0.0/24` is freed", "no `10.99.0.0/24` routing") — those are correct and expected. There must be **no** hit that still treats OPNsense WireGuard or `10.99.0.x` as the *live* remote-access path, and **no** `choice deferred` / `pending VPN choice` anywhere. Legitimate mentions of "WireGuard" as NetBird's *data plane* are fine and won't match this pattern (it only matches `WireGuard endpoint|peers|to OPNsense`). If a canonical doc still names the WireGuard VPN as live, fix it as in the relevant task above and amend that commit.
- [ ] **Step 2: Confirm ADR-016 exists and is cross-linked**
Run:
```bash
test -f docs/decisions/016-mesh-vpn.md && echo "ADR-016 present"
grep -rl "ADR-016\|016-mesh-vpn" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the file exists and the referencing docs (007, 015, accepted-risks, CAPABILITIES, STATUS, CLAUDE.md) appear.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks Passed/Skipped. Fix anything that fails (most likely trailing whitespace / end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
Per CLAUDE.md, push to `origin` is the off-machine backup. If the user wants it pushed:
```bash
git push origin <branch-or-main-after-merge>
```
---
## Self-review notes (author)
- **Spec coverage:** decision/architecture/security/recovery → Task 1 (ADR-016); the spec's "Documentation & implementation changes" table → Tasks 27; deferrals (external SSO, OPNsense mesh specifics, role implementation) are recorded in ADR-016/STATUS, not implemented here (correct — they need the unbuilt `base`/service-role machinery). ✓
- **Not in scope (intentional):** the `netbird_coordinator` service role, the `base`-role agent task, vault `setup_key` material, and any live deployment — all wait on `base`/service-role machinery (STATUS-honest). ✓
- **No placeholders:** every edit shows exact find/replace text; the `_(retired)_` token in ADR-007 is deliberate table content. ✓
- **Name consistency:** ADR file is `016-mesh-vpn.md` everywhere; `vault.netbird.setup_key`, `netbird_coordinator`, and `wt0` are used identically across ADR-016 and the sweep. ✓
```

View file

@ -0,0 +1,605 @@
# Service-UI Verification (Level 4) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Build the authorable-now parts of ADR-008 Level 4 — a Claude-driven exploratory service-UI verification harness — namely ADR-017, the `/verify-service` skill, the per-service `VERIFY.md` template/convention, and the doc reconciliations; the *live run* stays deferred on `ubongo`/Authentik/staging.
**Architecture:** Mostly documentation + two new authorable artifacts (the `/verify-service` Claude Code command and the `VERIFY.md` template). No application code, no Ansible roles (none of the prerequisite roles exist). The harness *mechanism* is the `playwright` Claude Code plugin driving Chromium on `ubongo`; this plan does not install or run it — it records the decision, the standards, and the orchestration logic.
**Tech Stack:** Markdown + a Claude Code command file. Verification is the repo's pre-commit hooks plus a final cross-reference/staleness sweep. No markdown linter exists, so "tests" are hook-pass + grep checks.
---
## Pre-flight (read once before starting)
- **`rbw` must be unlocked before every commit** (the pre-commit ansible-lint hook decrypts `vault.yml`). Run `rbw unlocked`; if it exits non-zero, stop and ask the user to `rbw unlock`.
- **Commit style:** one commit per task, imperative subject ≤72 chars.
- **Order matters:** Task 1 (ADR-017) lands first — later tasks link to it.
- **Spec reference:** `docs/superpowers/specs/2026-06-05-service-ui-verification-design.md`.
- **Branch:** the controller creates `chore/service-ui-verification-docs` off `main` before dispatching Task 1; do not implement on `main`.
---
## File map
| File | Action | Responsibility |
|---|---|---|
| `docs/decisions/017-service-ui-verification.md` | Create | Home of record for Level 4 verification |
| `docs/decisions/008-testing.md` | Modify | Expand the Level 4 stub; link ADR-017 |
| `docs/testing/service-verify-template.md` | Create | The `VERIFY.md` template (parallels `service-security-template.md`) |
| `.claude/commands/verify-service.md` | Create | The `/verify-service <name>` orchestrating skill |
| `docs/security/service-checklist.md` | Modify | Add "passed Level 4" to the pre-deploy gate |
| `CLAUDE.md` | Modify | Role-convention bullet (`VERIFY.md`); Further-reading ADR-017 row |
| `.gitignore` | Modify | Ignore the screenshot working dir |
| `docs/testing/reviews/README.md` | Create | Explains the committed-report dir (also makes the dir exist in git) |
| `STATUS.md` | Modify | Row: Level 4 verification (skill/template authorable; running deferred) |
| `docs/TODO.md` | Modify | Mark 2.2 (browser) + 2.3 addressed by ADR-017 |
**Deferred (not in this plan):** scaffolding `VERIFY.md` into `make new-role` (do it when that scaffold is next touched — noted in ADR-017); the Authentik test-user provisioning automation; per-service `VERIFY.md` files (no service roles exist); installing/running the `playwright` plugin.
---
### Task 1: Author ADR-017 (the home of record)
**Files:**
- Create: `docs/decisions/017-service-ui-verification.md`
- [ ] **Step 1: Create the ADR file**
Create `docs/decisions/017-service-ui-verification.md` with exactly this content (preserve em-dashes —, backticks, table pipes):
```markdown
# ADR-017 — Service-UI acceptance verification (Level 4)
## Context
ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a
Level 4 stub. Nothing below Level 4 exercises a service's **application UI** — none
answer "does PhotoPrism actually let me log in, upload a photo, and see a thumbnail?"
(TODO 8.2). The operator's ask (TODO 2.2 headless browsing + TODO 2.3 test users +
manual-test instruction): Claude spins up a browser, *sees* the service UI, exercises
it, generates test users, and instructs the operator on manual tests. Today Claude sees
a browser only passively (`/screenshot` fetches operator-taken shots from `mamba`); this
is the active counterpart.
## Decision
A Claude-driven exploratory service-UI verification harness — **Level 4** — invoked as
`/verify-service <name>` on `ubongo`. Five settled forks:
1. **Claude-driven exploratory** — Claude navigates with judgment, not deterministic
scripts. A scripted regression suite is explicitly not built here.
2. **Interactive, Claude-in-the-loop** — exploratory judgment can't be a headless cron
gate; scheduled smoke is a determinism job for health checks / Uptime Kuma later.
3. **Staging, full exercise** — Claude creates test users and exercises features
(incl. destructive flows) against a *staging* deploy; the rebuildable sandbox
resolves safety.
4. **Test users in Authentik (central IdP), real SSO flow** — authenticates through
Traefik + Authentik as a real user would.
5. **Per-service `VERIFY.md` backbone + free exploration** — each service role ships an
acceptance spec of critical journeys; Claude executes it and explores beyond it.
## VERIFY.md standard
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from
`docs/testing/service-verify-template.md` — parallel to `SECURITY.md` from
`service-security-template.md`. A new role convention. It lists the service's critical
user journeys (what "working" means), what good looks like, and what is not
browser-verifiable (→ manual handoff). It also joins the pre-production gate in
`docs/security/service-checklist.md`.
## Test-user standard (TODO 2.3)
Test identities live only in the **staging** Authentik (never production): a dedicated
`test` group / naming prefix; ephemeral per-run credentials (staging is rebuildable, so
nothing persisted, none in `vault.yml`); reuse-or-create; teardown via staging rebuild
or explicit `test`-group cleanup.
## Reporting & manual handoff
`/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md` (+ `latest.md`),
mirroring `/review-repo` and `/capacity-review`: pass/fail per `VERIFY.md` journey,
observations, the test-user/env used, a verdict, and a structured **manual-test
checklist** for anything Claude can't do (physical device, paid/external flow,
subjective judgment) — the "instruct me on tests" output. Screenshots are saved to a
git-ignored working dir on `ubongo` (PNG bloat + secret-leak risk); the report links
them.
## Safety
- **Staging-only guard** — the skill refuses to run against production (exploratory
clicking is destructive); ADR-002-aligned hard stop.
- **Confined blast radius** — test users only in the staging `test` group; the run
sticks to the target service.
- **No secrets leaked** — the git-ignored screenshot dir is the safety boundary;
avoid capturing credential screens.
## Status
Designed. **Authorable now:** this ADR, the ADR-008 Level 4 expansion, the `VERIFY.md`
template, the `/verify-service` skill, the convention/checklist/Further-reading edits,
`.gitignore`/dir, STATUS/TODO. **Running is deferred** on its dependencies.
## Dependencies
- `ubongo` (ADR-015) — runs the browser. Designed, not built.
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
- A staging deploy of the service (ADR-008 Level 2) — staging is currently empty stubs.
- `make new-role` scaffolding `VERIFY.md` — deferred to when that scaffold is next touched.
## What was ruled out
| Option | Reason |
|---|---|
| Scripted Playwright regression suite | Operator wants exploratory judgment; scripts add maintenance burden. Could be a later layer, not this. |
| Scheduled headless smoke gate | Needs determinism the exploratory nature excludes; belongs to health checks / Uptime Kuma. |
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. |
| Free-form, no per-service spec | Non-repeatable, can miss a critical flow; `VERIFY.md` gives a backbone. |
| Staging bypasses SSO / per-app users | Wouldn't exercise the real Traefik+Authentik path; central test users are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`. |
See also: ADR-008 (testing — expanded), ADR-015 (control host), ADR-002 (security),
ADR-004 (`VERIFY.md` parallels `SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/017-service-ui-verification.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/017-service-ui-verification.md
git commit -m "Add ADR-017 (service-UI acceptance verification, Level 4)"
```
---
### Task 2: Expand the ADR-008 Level 4 stub
**Files:**
- Modify: `docs/decisions/008-testing.md`
- [ ] **Step 1: Replace the Level 4 stub with the full definition**
Find this exact block:
```
### Level 4 — Service-UI acceptance (planned, not built)
Claude drives a headless browser from `ubongo` against a *deployed* service: loads
the rendered UI, creates test users, exercises features, and hands the operator a
manual test script for the rest. Catches application-level regressions that no lower
level sees. The harness (Playwright/headless-Chromium, screenshot-back-to-Claude) is
a **separate spec**; `ubongo` is sized for it (ADR-015). Status: designed, not built
(STATUS.md).
```
Replace with:
```
### Level 4 — Service-UI acceptance (Claude-driven exploratory)
A Claude-driven exploratory check of a service's **application UI**, run as
`/verify-service <name>` on `ubongo` (ADR-017). Claude drives Chromium via the
`playwright` plugin against a **staging** deploy, authenticates through the real
Traefik + Authentik SSO flow using a test user in the staging `test` group, then
executes the service's `roles/<service>/VERIFY.md` acceptance journeys *and*
free-explores — judging pass/fail, screenshotting key states. It writes a dated report
to `docs/testing/reviews/` and hands the operator a manual-test checklist for anything
it can't verify (hardware, paid/external flows, subjective judgment).
Catches application-level regressions no lower level sees ("does PhotoPrism actually
serve photos?"). Placement: after Level 2 (staging deploy), before production
promotion. Exploratory and interactive by design — *not* a deterministic CI/cron gate
(that role belongs to health checks / Uptime Kuma).
**Status:** the skill, the `VERIFY.md` template, and standards are authorable now;
running it is deferred on `ubongo` + the `playwright` plugin + Authentik + a staging
deploy (STATUS.md). Full design: ADR-017.
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/008-testing.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/008-testing.md
git commit -m "ADR-008: expand Level 4 into the verify-service harness (ADR-017)"
```
---
### Task 3: Create the `VERIFY.md` template
**Files:**
- Create: `docs/testing/service-verify-template.md`
- [ ] **Step 1: Create the template**
Create `docs/testing/service-verify-template.md` with exactly this content (preserve `&lt;`/`&gt;` HTML escapes, em-dashes, backticks):
```markdown
# Per-service verification record — template
Copy this file to `roles/<service>/VERIFY.md` and fill it in when building a service
role (ADR-008 Level 4 / ADR-017). It is the per-service **acceptance spec**: the
critical user journeys that define "working" for this service. `/verify-service <name>`
reads it, drives a browser through them against the staging deploy, and explores beyond
them.
Delete this preamble in the copy and start from the heading below.
---
# Verify — &lt;service&gt;
## Critical user journeys
The acceptance criteria — what "working" means for this service. Numbered; each is an
action and its expected result. Example shape (replace with this service's flows):
1. SSO login via Authentik succeeds and lands on the service's home/dashboard.
2. &lt;core action&gt; — e.g. "upload a test image" → &lt;expected&gt; — "a thumbnail renders".
3. &lt;core action&gt;&lt;expected&gt;.
## What good looks like
Key states/screens Claude should confirm (and screenshot) — the visual/textual signals
that the journeys above actually succeeded.
- &lt;e.g. "the uploaded image appears in the library grid within ~10s"&gt;
## Not browser-verifiable
Items to route to the manual-test handoff — things a headless browser can't or
shouldn't judge.
- &lt;e.g. hardware passthrough, a paid/external integration, subjective media quality&gt;
## Test data
What the journeys need, provisioned in the **staging** Authentik `test` group
(ephemeral, torn down by staging rebuild).
- &lt;e.g. "one test user; no pre-seeded content"&gt;
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/testing/service-verify-template.md`
Expected: Passed/Skipped.
```bash
git add docs/testing/service-verify-template.md
git commit -m "Add VERIFY.md template for service-UI acceptance (ADR-017)"
```
---
### Task 4: Create the `/verify-service` skill
**Files:**
- Create: `.claude/commands/verify-service.md`
- [ ] **Step 1: Create the command file**
Create `.claude/commands/verify-service.md` with exactly this content (preserve em-dashes, backticks, code fences):
```markdown
Exploratory service-UI verification (ADR-008 Level 4 / ADR-017)
Drive a browser against a **staging** deploy of a service, exercise its
`roles/<service>/VERIFY.md` acceptance journeys plus free exploration, and write a
tracked report. Argument: the service/role name (e.g. `/verify-service photoprism`).
## Prerequisites (this is forward-looking — ADR-017 dependencies)
This skill cannot run until all of these exist; if any is missing, say so and stop —
do not improvise around it:
- `ubongo` with the `playwright` Claude Code plugin (browser automation tools).
- A **staging** deploy of the target service (ADR-008 Level 2).
- Authentik (staging) for test-user provisioning + SSO.
- `roles/<name>/VERIFY.md` present.
## Process
### Phase 0 — safety gate (staging only)
Confirm the target resolves to the **staging** environment/inventory, never production.
If you cannot prove it is staging, **stop** — exploratory clicking is destructive
(ADR-002). State why you stopped.
### Phase 1 — read intent
Read `roles/<name>/VERIFY.md`: the Critical user journeys, What good looks like, Not
browser-verifiable, and Test data sections.
### Phase 2 — test user
Provision (reuse-or-create) a test user in the staging Authentik `test` group, with
ephemeral credentials held only for this run. Never use a real/production account.
### Phase 3 — drive the browser
Via the `playwright` plugin, on `ubongo`: open the service's staging URL (resolved via
boma DNS), authenticate through the real Traefik + Authentik SSO flow, then execute each
`VERIFY.md` journey — judging pass/fail and screenshotting key states — and free-explore
for anything obviously broken. Save screenshots to the git-ignored `.verify-runs/`
working dir; avoid capturing credential screens.
### Phase 4 — write the report
Save to `docs/testing/reviews/YYYY-MM-DD-<name>.md` and overwrite
`docs/testing/reviews/latest.md`. Structure:
- **One-line verdict** — e.g. "5/5 journeys passed; one manual check pending".
- **Run metadata** — date, service, staging env, test user, reviewed commit SHA.
- **Per-journey result** — pass/fail against `VERIFY.md`, with the evidence (linked
screenshot path) and any observation.
- **Free-exploration findings** — anything noticed beyond the listed journeys.
- **Manual-test checklist** — the "Not browser-verifiable" items plus anything Claude
couldn't do: numbered steps, expected result, and why it was handed off.
### Phase 5 — clean up + commit
Offer to clean up the `test`-group user (or note that the staging rebuild will).
Commit the report markdown per CLAUDE.md git conventions. **Do not** commit
`.verify-runs/` (git-ignored).
## Notes
- Reports (markdown) are committed; screenshots stay local on `ubongo` in `.verify-runs/`.
- Exploratory and interactive — this is not a deterministic CI gate.
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files .claude/commands/verify-service.md`
Expected: Passed/Skipped.
```bash
git add .claude/commands/verify-service.md
git commit -m "Add /verify-service skill for Level 4 UI verification (ADR-017)"
```
---
### Task 5: Add Level 4 to the service-clearance gate
**Files:**
- Modify: `docs/security/service-checklist.md`
- [ ] **Step 1: Add an Operability bullet for Level 4**
Find this exact block:
```
## Operability (security-adjacent)
- [ ] Logs go somewhere reviewable (central aggregation when available)
- [ ] Backup/restore is covered if the service holds state
```
Replace with:
```
## Operability (security-adjacent)
- [ ] Logs go somewhere reviewable (central aggregation when available)
- [ ] Backup/restore is covered if the service holds state
- [ ] Passed Level 4 service-UI verification (`/verify-service`) against staging — the
service has a populated `roles/<service>/VERIFY.md` and its critical journeys
verified (ADR-008 Level 4 / ADR-017)
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/service-checklist.md`
Expected: Passed/Skipped.
```bash
git add docs/security/service-checklist.md
git commit -m "service-checklist: add Level 4 UI verification to the gate"
```
---
### Task 6: Update CLAUDE.md (role convention + Further reading)
**Files:**
- Modify: `CLAUDE.md`
- [ ] **Step 1: Add the `VERIFY.md` role-convention bullet**
Find this exact line:
```
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
```
Replace with that SAME line followed by a new bullet:
```
- Every **service** role must have a populated `SECURITY.md` (ADR-002/004) — copy `docs/security/service-security-template.md`
- Every **service** role must have a populated `VERIFY.md` (ADR-008/017) — copy `docs/testing/service-verify-template.md`
```
- [ ] **Step 2: Add the ADR-017 Further-reading row**
Find this exact line:
```
| Testing methodology | `docs/decisions/008-testing.md` |
```
Replace with that SAME line followed by a new row:
```
| Testing methodology | `docs/decisions/008-testing.md` |
| Service-UI verification (Level 4) | `docs/decisions/017-service-ui-verification.md` |
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: Passed/Skipped.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: VERIFY.md role convention; link ADR-017"
```
---
### Task 7: Git-ignore screenshots + create the reviews dir
**Files:**
- Modify: `.gitignore`
- Create: `docs/testing/reviews/README.md`
- [ ] **Step 1: Add the screenshot working dir to `.gitignore`**
Find this exact block at the end of `.gitignore`:
```
# Terraform
terraform/**/.terraform/
terraform/**/*.tfstate
terraform/**/*.tfstate.backup
terraform/**/terraform.tfvars
# .terraform.lock.hcl is intentionally tracked (pins provider versions)
```
Replace with:
```
# Terraform
terraform/**/.terraform/
terraform/**/*.tfstate
terraform/**/*.tfstate.backup
terraform/**/terraform.tfvars
# .terraform.lock.hcl is intentionally tracked (pins provider versions)
# Service-UI verification screenshots (kept locally on ubongo, not committed — ADR-017)
.verify-runs/
```
- [ ] **Step 2: Create the reviews dir README (so the dir exists in git)**
Create `docs/testing/reviews/README.md` with exactly this content:
```markdown
# Service-UI verification reports
Dated reports written by `/verify-service` (ADR-008 Level 4 / ADR-017), one per run:
`YYYY-MM-DD-<service>.md`, plus `latest.md`. These markdown reports are committed; the
screenshots they reference stay local on `ubongo` in the git-ignored `.verify-runs/`
working dir.
No reports yet — the harness is designed, not yet runnable (see STATUS.md).
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files .gitignore docs/testing/reviews/README.md`
Expected: Passed/Skipped.
```bash
git add .gitignore docs/testing/reviews/README.md
git commit -m "Git-ignore verify screenshots; add testing/reviews dir"
```
---
### Task 8: Add the Level 4 row to STATUS.md
**Files:**
- Modify: `STATUS.md`
- [ ] **Step 1: Add a row to the "Designed but not built" table**
Find this exact line:
```
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
```
Replace with that SAME line followed by the new row:
```
| NetBird agent enrollment in `base` | ADR-016 | Every Linux host joins the mesh via the base role (setup keys in vault); SSH allowed only on `wt0`. Designed; base role not built. |
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | `/verify-service` skill + `VERIFY.md` template + standards are authorable and present; *running* deferred on ubongo + `playwright` plugin + Authentik + a staging deploy. |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: Passed/Skipped.
```bash
git add STATUS.md
git commit -m "STATUS: record Level 4 service-UI verification (ADR-017)"
```
---
### Task 9: Mark TODO 2.2/2.3 addressed
**Files:**
- Modify: `docs/TODO.md`
- [ ] **Step 1: Annotate the Testing items**
Find this exact block:
```
2. **Testing**
1. Choose and configure code-testing tooling (Molecule, etc.).
2. Decide how the AI interprets Molecule output and performs live testing:
API calls, curl pulls of web products, log reviews, and headless browsing.
3. Define a standard for generating test users and for instructing the user to
perform relevant manual tests.
```
Replace with:
```
2. **Testing**
1. Choose and configure code-testing tooling (Molecule, etc.).
2. Decide how the AI interprets Molecule output and performs live testing:
API calls, curl pulls of web products, log reviews, and headless browsing.
— Headless browsing DECIDED (ADR-017): the `/verify-service` Level 4 harness.
The API/curl/log-review siblings remain open.
3. ~~Define a standard for generating test users and for instructing the user to
perform relevant manual tests.~~ DECIDED (ADR-017): test users in the staging
Authentik `test` group; manual tests handed off as a checklist in the
`/verify-service` report.
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/TODO.md`
Expected: Passed/Skipped.
```bash
git add docs/TODO.md
git commit -m "TODO: mark headless-browsing + test-user standard decided (ADR-017)"
```
---
### Task 10: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: Confirm ADR-017 is present and cross-linked**
Run:
```bash
test -f docs/decisions/017-service-ui-verification.md && echo "ADR-017 present"
grep -rl "ADR-017\|017-service-ui-verification" docs/ CLAUDE.md STATUS.md .claude/ | grep -vE "superpowers/(plans|specs)/"
```
Expected: the file exists and the referencing files appear — ADR-008, CLAUDE.md, STATUS.md, the `VERIFY.md` template, the `/verify-service` skill, service-checklist, TODO, the reviews README.
- [ ] **Step 2: Confirm the new artifacts exist and the Level 4 stub is gone**
Run:
```bash
ls docs/testing/service-verify-template.md .claude/commands/verify-service.md docs/testing/reviews/README.md
grep -n "planned, not built" docs/decisions/008-testing.md || echo "Level 4 stub replaced (good)"
grep -n "\.verify-runs/" .gitignore && echo "screenshot dir ignored (good)"
```
Expected: all three files listed; the old Level 4 "planned, not built" stub line gone; `.verify-runs/` in `.gitignore`.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
```bash
git push origin <branch-or-main-after-merge>
```
---
## Self-review notes (author)
- **Spec coverage:** decision/forks/architecture → Task 1 (ADR-017) + Task 2 (ADR-008); `VERIFY.md` standard → Task 3 (template) + Task 6 (convention) + Task 5 (gate); skill/mechanism/reporting/safety → Task 4 (`/verify-service`); reporting dir + screenshot policy → Task 7; STATUS/TODO reconciliation → Tasks 89. ✓
- **Buildable-now vs deferred:** every task is authorable without `ubongo`/Authentik/staging; the skill carries an explicit Prerequisites gate so it cannot pretend to run. Deferred items (new-role scaffold, Authentik automation, per-service `VERIFY.md`, plugin install) are recorded in ADR-017/STATUS, not implemented. ✓
- **No placeholders:** every create/edit shows exact content; the `&lt;…&gt;` tokens in the template are deliberate (match `service-security-template.md`'s house style). ✓
- **Name consistency:** `/verify-service`, `roles/<service>/VERIFY.md`, `docs/testing/service-verify-template.md`, `docs/testing/reviews/`, `.verify-runs/`, and the `test` Authentik group are used identically across all tasks. ✓
```

View file

@ -0,0 +1,331 @@
# Firewall Strategy (ADR-020) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Land the firewall *strategy* as ADR-020 and fold it into the living docs — no firewall code is built here (the host-nftables and OPNsense-as-code builds are separate follow-up specs).
**Architecture:** This is a documentation-only change. It creates `docs/decisions/020-firewall.md` from the approved design spec, then updates CLAUDE.md (Further reading + the firewall guardrail), `docs/TODO.md` (mark 3.5 decided), and `docs/CAPABILITIES.md` (point the firewall note at ADR-020). There is no executable code, so verification is consistency greps + `make lint`.
**Tech Stack:** Markdown docs only. `make lint` (yamllint + ansible-lint + check-tags) must stay green; none of these tools lint Markdown content, but the run confirms nothing else broke.
---
## File structure
| File | Responsibility | Action |
|------|----------------|--------|
| `docs/decisions/020-firewall.md` | The firewall strategy ADR (two-layer model, shared catalog, deferred builds) | Create |
| `CLAUDE.md` | Add ADR-020 to *Further reading*; harden the firewall guardrail bullet to reference the catalog/ADR-020 | Modify |
| `docs/TODO.md` | Mark item 3.5 DECIDED (ADR-020) | Modify |
| `docs/CAPABILITIES.md` | Point the existing firewall parenthetical at ADR-020 + the two-layer model | Modify |
Notes for the implementer:
- The design spec this ADR is based on is `docs/superpowers/specs/2026-06-06-firewall-strategy-design.md` — read it if you need the full rationale, but the ADR text below is complete and self-contained.
- Existing ADRs live in `docs/decisions/` numbered 001019; this is 020. Match their concise, decision-focused tone (ADR-019 is a good recent reference).
- Before any `git commit`, the pre-commit hook runs and decrypts `vault.yml`, so the vault agent must be unlocked: run `rbw unlocked` (exit 0 = good). If locked, ask the user to `rbw unlock` and wait. None of these tasks touch vault files.
- Run `make lint` via the repo venv wiring (the Makefile handles paths).
---
### Task 1: Write ADR-020
**Files:**
- Create: `docs/decisions/020-firewall.md`
- [ ] **Step 1: Create the ADR**
Create `docs/decisions/020-firewall.md` with exactly this content:
````markdown
# ADR-020 — Firewall strategy: two-layer model with a shared service catalog
## Status
Accepted (2026-06-06). Resolves TODO 3.5 ("Decide the firewall strategy — which
firewall, ruleset, per-host vs central").
**Strategy ADR.** It pins the architecture and each layer's responsibilities; the
detailed builds are separate follow-up efforts (see *Scope*).
## Context
boma needs a firewall strategy that is predictable, declarative, and defends the stated
threat model — opportunistic external, lateral movement / blast radius, operator/agent
error (ADR-002). The pieces were already committed across other ADRs (`nftables`
default-deny on hosts — ADR-002; OPNsense at the perimeter — ADR-007; Docker with
`iptables: false` — ADR-004), but nothing tied them together: which layer owns what,
where firewall intent is declared, and how the layers stay consistent. Without that,
ports drift open ad-hoc and "per-host vs central" stays unanswered.
## Decision
### Two layers, distinct jobs
**OPNsense — perimeter + inter-VLAN.** Owns the WAN edge and all policy *between zones*:
`lan`/`iot`/`guest``srv`, `mgmt` access, and the per-VLAN egress rules (ADR-007). It
is **structurally blind to intra-`srv` traffic** — services share the switched `srv`
subnet (VLAN 20), which never reaches the gateway.
**Host nftables — host-local + east-west within `srv`** (in the `base` role, every VM):
- **Default-deny inbound**; allow loopback + established/related.
- **East-west allowlist**: a service host accepts a connection only from declared
sources (e.g. the reverse proxy, a named peer) — the lateral-movement control OPNsense
cannot provide.
- **Permissive egress**: allow outbound + established/related; per-VLAN egress
restriction stays at OPNsense (ADR-007). Host-level egress allowlisting is
high-friction (every DNS/NTP/update/registry/webhook must be enumerated) for limited
added benefit once the VLAN already bounds where a host can go.
- **Docker**: daemon runs with `"iptables": false`; nftables owns all filtering,
including container traffic (ADR-004).
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
is allowed only on `wt0`.)
So "per-host vs central" is answered: **both**, with clear ownership.
### Single source of truth — a shared service catalog
A central, declarative **service catalog** in `group_vars/` is the one source of truth
for firewall intent (aligning with ADR-002's "port definitions live in `group_vars/`",
and keeping connectivity *topology* in inventory rather than in any one self-contained
service role — ADR-004). Each entry describes a service's **ingress**:
```yaml
photoprism:
ingress:
- { from: reverse_proxy, port: 2342, proto: tcp }
reverse_proxy:
ingress:
- { from: lan, port: 443, proto: tcp }
```
`from` is **symbolic**, resolved at render time: a host/group → IP(s) from inventory; a
role (`reverse_proxy`) → the host(s) filling it; a VLAN/zone (`lan`) → the subnet from
the ADR-007 table. This keeps the catalog readable and resilient to IP changes.
### Each layer renders only its own slice
| Ingress rule | Host nftables | OPNsense |
|---|---|---|
| `from: reverse_proxy` (a `srv` peer) | allow proxy IP → port | — (intra-`srv`, invisible) |
| `from: lan` (cross-VLAN) | allow `lan` subnet → port | allow `lan` → host:port |
The dominant pattern falls out naturally: most services are **proxied** — their only
ingress is `from: reverse_proxy`, and users reach them through the reverse proxy, which
alone carries `from: lan, port: 443` (matches "services sit behind the reverse proxy
with authentication", ADR-002).
This was chosen over a single connectivity-model-generates-both (too much machinery,
tight coupling of two very different rule domains) and over fully independent per-layer
declarations (real drift risk).
### OPNsense automation — owned here, mechanism deferred
OPNsense is Ansible-managed (CLAUDE.md: "OPNsense is entirely Ansible; no Terraform
OPNsense provider"). It renders the cross-VLAN slice of the catalog plus the static
ADR-007 facts. The **how** — config-XML templating vs the OPNsense API vs a plugin — is
deferred to the OPNsense-as-code follow-up spec. Recorded as an explicit open
sub-decision.
## Guardrails
- **The catalog is authoritative.** If a port is not in the catalog, it does not exist —
hardening the existing rule "never open a firewall port ad-hoc on a host" (ADR-002).
- **The `firewall` tag** (ADR-019) marks firewall tasks; `--tags firewall` re-renders
rules.
- **Drift detection (aspiration).** A deterministic check — in the spirit of
`scripts/check-tags.py` — comparing each host's live `nft` ruleset / listening ports
against the catalog and flagging anything undeclared. Ties to TODO 8.5
(`/security-review`). Not necessarily built first.
## Consequences
- Lateral movement within `srv` is constrained — the gap OPNsense structurally can't
close.
- One declarative catalog → no ad-hoc ports and no cross-layer drift on shared facts
(ports, IPs, sources).
- Cost: the catalog + render-per-layer machinery must be built and maintained; east-west
allowlisting adds per-service ingress declarations (mitigated by proxied-by-default,
which keeps most entries to a single line).
## Scope
**Decided here:** the two-layer model and responsibilities; host nftables = default-deny
inbound + east-west allowlist + permissive egress + guaranteed management plane + Docker
`iptables:false`; the shared `group_vars` catalog as single source of truth with
symbolic sources; each layer renders its own slice; the no-ad-hoc-ports guardrail.
**Deferred to follow-up specs (each its own brainstorm → plan):**
1. **Host nftables implementation** in `base` — catalog schema, nftables template,
Docker `iptables:false` integration, fail-safe ordering, Molecule tests. The natural
next spec.
2. **OPNsense-as-code** — tooling mechanism + cross-VLAN rule rendering.
3. **Drift-detection check** — if/when built.
## Related
ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),
ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag).
````
- [ ] **Step 2: Verify the file is well-formed**
Run:
```bash
test -f docs/decisions/020-firewall.md && grep -c "^## " docs/decisions/020-firewall.md
```
Expected: exit 0 and a printed count of `7` (the H2 sections: Status, Context, Decision, Guardrails, Consequences, Scope, Related — H3 subsections under Decision are not counted by `^## `).
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/020-firewall.md
git commit -m "docs(adr): ADR-020 firewall strategy (two-layer + shared catalog)"
```
---
### Task 2: Wire ADR-020 into CLAUDE.md
**Files:**
- Modify: `CLAUDE.md` (Further reading table; firewall guardrail bullet)
- [ ] **Step 1: Add ADR-020 to the Further reading table**
In `CLAUDE.md`, find this row (around line 225):
```markdown
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
```
Add this row immediately after it:
```markdown
| Firewall strategy | `docs/decisions/020-firewall.md` |
```
(Exact column padding need not match perfectly — just produce a valid Markdown table row consistent with the surrounding rows.)
- [ ] **Step 2: Harden the firewall guardrail bullet**
In `CLAUDE.md`, find this bullet (around line 172, under "What Claude must not do without explicit instruction"):
```markdown
- Open a firewall port anywhere but the `group_vars` firewall definitions — never ad-hoc on a host (ADR-002)
```
Replace it with:
```markdown
- Open a firewall port anywhere but the `group_vars` service catalog — never ad-hoc on a host. If it's not in the catalog, it doesn't exist (ADR-002, ADR-020)
```
- [ ] **Step 3: Verify both edits**
Run:
```bash
grep -n "020-firewall" CLAUDE.md && grep -n "service catalog" CLAUDE.md
```
Expected: the Further reading row matches `020-firewall`, and the guardrail bullet now contains "service catalog".
- [ ] **Step 4: Commit**
```bash
git add CLAUDE.md
git commit -m "docs: link ADR-020; harden firewall guardrail to the service catalog"
```
---
### Task 3: Mark TODO 3.5 decided
**Files:**
- Modify: `docs/TODO.md` (item 3.5)
- [ ] **Step 1: Strike through and annotate item 3.5**
In `docs/TODO.md`, find this line (around line 26):
```markdown
5. Decide the firewall strategy (which firewall, ruleset, per-host vs central).
```
Replace it with:
```markdown
5. ~~Decide the firewall strategy (which firewall, ruleset, per-host vs central).~~
DECIDED (ADR-020): two layers — OPNsense (perimeter + inter-VLAN) + host nftables
(default-deny inbound + east-west allowlist, permissive egress). Single source of
truth: a `group_vars` service catalog with symbolic sources; each layer renders
its own slice. Builds deferred to follow-up specs (host nftables in `base`, then
OPNsense-as-code).
```
- [ ] **Step 2: Verify**
Run: `grep -n "DECIDED (ADR-020)" docs/TODO.md`
Expected: one match on the item 3.5 annotation.
- [ ] **Step 3: Commit**
```bash
git add docs/TODO.md
git commit -m "docs(todo): mark 3.5 firewall strategy decided (ADR-020)"
```
---
### Task 4: Update CAPABILITIES.md firewall note
**Files:**
- Modify: `docs/CAPABILITIES.md` (the firewall parenthetical in §1 Edge & networking, around line 32)
- [ ] **Step 1: Point the firewall note at ADR-020**
In `docs/CAPABILITIES.md`, find this line (around line 32, just under the §1 table):
```markdown
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
```
Replace it with:
```markdown
_(DHCP, firewall, mDNS reflection live on OPNsense — Ansible-managed, not containers.)_
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
role from a shared `group_vars` service catalog. Both layers are still to be built._
```
- [ ] **Step 2: Verify and run the full lint suite**
Run:
```bash
grep -n "ADR-020" docs/CAPABILITIES.md && make lint
```
Expected: the new ADR-020 note is found, and `make lint` passes (yamllint clean, ansible-lint clean, `check-tags: OK`).
- [ ] **Step 3: Commit**
```bash
git add docs/CAPABILITIES.md
git commit -m "docs(capabilities): note two-layer firewall model (ADR-020)"
```
---
## Final verification
- [ ] Confirm cross-references resolve:
```bash
ls docs/decisions/020-firewall.md && grep -rl "ADR-020\|020-firewall" CLAUDE.md docs/TODO.md docs/CAPABILITIES.md
```
Expected: the ADR file exists and all three living docs reference it.
- [ ] `make lint` passes end to end.
- [ ] `git log --oneline -4` shows the four task commits.
- [ ] Sanity: the ADR's *Scope* section names the two deferred build specs (host nftables in `base`, OPNsense-as-code) so the next brainstorm has an obvious starting point.

View file

@ -0,0 +1,712 @@
# Host nftables Firewall (`base` firewall concern) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Build the `firewall`-tagged concern of the `base` role — default-deny nftables rendered from a shared `group_vars` service catalog, applied with an auto-rollback safety net.
**Architecture:** A pure Python filter plugin resolves the global `firewall_catalog`/`firewall_zones` into a flat per-host rule list; a Jinja template renders `/etc/nftables.conf` (validated at render time with `nft -c`); tasks apply it safely (snapshot → armed `systemd-run` revert → apply → confirm/disarm → persist). Molecule renders + syntax-checks only (never applies — it shares the host kernel); the resolver is unit-tested with pytest; real enforcement is a Level-2 staging concern.
**Tech Stack:** Ansible (`ansible.builtin` only — no new collections), nftables, Python 3 filter plugin + pytest, Molecule (Docker driver), systemd (`systemd-run` transient timer).
---
## File structure
| File | Responsibility | Action |
|------|----------------|--------|
| `roles/base/` (scaffold) | the base role skeleton | Create via `make new-role` |
| `roles/base/meta/main.yml` | role metadata (galaxy_info) | Fill |
| `roles/base/defaults/main.yml` | `base__firewall_*` behaviour knobs | Create |
| `inventories/{staging,production}/group_vars/all/firewall.yml` | shared `firewall_zones` + `firewall_catalog` | Create |
| `roles/base/filter_plugins/firewall_rules.py` | pure catalog→rules resolver | Create |
| `tests/test_firewall_rules.py` | pytest units for the resolver | Create |
| `roles/base/templates/nftables.conf.j2` | the ruleset | Create |
| `roles/base/tasks/main.yml` | include `firewall.yml` (tagged) | Replace scaffold |
| `roles/base/tasks/firewall.yml` | install + render + safe-apply | Create |
| `roles/base/molecule/default/molecule.yml` | fixture `ansible_host` | Adjust scaffold |
| `roles/base/molecule/default/converge.yml` | fixture catalog/zones + `apply:false` | Replace scaffold |
| `roles/base/molecule/default/verify.yml` | assert rendered rules + `nft -c` | Replace scaffold |
| `roles/base/README.md` | document the firewall concern | Fill |
| `STATUS.md`, `docs/CAPABILITIES.md` | reflect the build | Modify |
Notes for the implementer:
- Run Ansible/Python via the repo venv (`.venv/bin/...`); the Makefile wires paths. Molecule: `make test ROLE=base`.
- The Molecule platform pulls `forgejo.nyumbani.baobab.band/sjat/molecule-debian13:latest`. If the registry/image is unreachable in your environment, `make test` can't run — report DONE_WITH_CONCERNS for that step; the pytest units (Task 3) still fully validate the resolver logic, which is the only non-trivial code.
- Before any `git commit`, the pre-commit hook decrypts `vault.yml`, so the vault agent must be unlocked: run `rbw unlocked` (exit 0 = good); if locked, ask the user to `rbw unlock`. None of these tasks touch vault files.
- `make lint` must stay green (yamllint + ansible-lint over the new role + `check-tags`). Use FQCN, a tag on every task, string `mode:`, and `changed_when:` on every `command`/`shell`.
---
### Task 1: Scaffold the `base` role
**Files:**
- Create: `roles/base/` (via `make new-role`)
- Fill: `roles/base/meta/main.yml`, `roles/base/README.md`
- [ ] **Step 1: Scaffold**
Run: `make new-role NAME=base`
Expected: prints "Role base scaffolded at roles/base/". Creates `roles/base/{tasks,handlers,defaults,templates,files,meta,molecule/default}` and a scaffold `tasks/main.yml` (`---`), `molecule/default/{molecule.yml,converge.yml,verify.yml}`, `README.md`.
- [ ] **Step 2: Fill `roles/base/meta/main.yml`**
Replace the scaffold `---` with:
```yaml
---
galaxy_info:
author: sjat
description: Hardened baseline configuration for all boma hosts (Debian 13).
license: MIT
min_ansible_version: "2.17"
platforms:
- name: Debian
versions:
- trixie
dependencies: []
```
- [ ] **Step 3: Write `roles/base/README.md`**
Replace the scaffold content with:
```markdown
# base
Hardened baseline applied to every boma host. Built incrementally; the first concern
implemented is the **host firewall** (`firewall` tag).
## Firewall (nftables)
Default-deny inbound + east-west allowlisting + permissive egress, per ADR-020. Rules
are rendered from the shared `firewall_catalog` / `firewall_zones` (in `group_vars/all`)
by the `resolve_firewall_rules` filter, written to `/etc/nftables.conf`, syntax-checked
with `nft -c` at render time, and applied with an **auto-rollback safety net**
(`systemd-run` arms a revert that a follow-up task cancels once connectivity is
confirmed). The apply sequence lives in tasks rather than a handler so the confirm/cancel
step is controllable.
`/etc/nftables.d/*.nft` is `include`d by the ruleset — the extension hook the
`docker_host` role uses for container forward/NAT rules.
### Variables
See `defaults/main.yml` (`base__firewall_*`). SSH is accepted only on
`base__firewall_mgmt_interface` (default `wt0`, the NetBird overlay — ADR-016); set it to
a reachable interface/source until NetBird is built. Set `base__firewall_apply: false` to
render + validate without applying (used by Molecule).
### Testing
- `tests/test_firewall_rules.py` — pytest units for the resolver.
- `make test ROLE=base` — Molecule renders + `nft -c` syntax-checks (never applies; it
shares the host kernel). Enforcement + the apply/rollback path are verified at ADR-008
Level 2 on staging VMs.
```
- [ ] **Step 4: Verify scaffold + lint**
Run: `test -d roles/base/molecule/default && .venv/bin/ansible-lint roles/base`
Expected: directory exists; ansible-lint passes (the scaffold `tasks/main.yml` is empty `---`, meta is now filled).
- [ ] **Step 5: Commit**
```bash
git add roles/base
git commit -m "feat(base): scaffold role + meta/README (firewall concern incoming)"
```
---
### Task 2: Shared catalog/zones + role defaults
**Files:**
- Create: `inventories/staging/group_vars/all/firewall.yml`
- Create: `inventories/production/group_vars/all/firewall.yml`
- Create: `roles/base/defaults/main.yml`
- [ ] **Step 1: Create the shared firewall data (both envs)**
Write this identical content to **both** `inventories/staging/group_vars/all/firewall.yml`
**and** `inventories/production/group_vars/all/firewall.yml`:
```yaml
---
# Shared firewall topology — single source of truth for the host nftables layer
# (base role) and OPNsense (future). See docs/decisions/020-firewall.md.
# Zone → subnet (from ADR-007).
firewall_zones:
mgmt: 10.10.0.0/24
srv: 10.20.0.0/24
lan: 10.30.0.0/24
iot: 10.40.0.0/24
guest: 10.50.0.0/24
# Service catalog: <name> → placement (host | group | hosts) + ingress[].
# Empty until services are built; hosts still get default-deny + the management plane.
firewall_catalog: {}
```
- [ ] **Step 2: Create `roles/base/defaults/main.yml`**
Replace the scaffold `---` with:
```yaml
---
# Host firewall (nftables) behaviour knobs. Shared topology (firewall_catalog/
# firewall_zones) lives in group_vars/all, not here. See docs/decisions/020-firewall.md.
base__firewall_mgmt_interface: wt0 # SSH accepted only on this iface (NetBird, ADR-016)
base__firewall_ssh_port: 22
base__firewall_rollback_timeout: 45 # seconds before the auto-revert fires on a bad apply
base__firewall_dropin_dir: /etc/nftables.d
base__firewall_apply: true # set false to render+validate without applying (CI/Molecule)
```
- [ ] **Step 3: Verify + lint**
Run: `.venv/bin/python -c "import yaml; [print(sorted(yaml.safe_load(open(p))['firewall_zones'])) for p in ['inventories/staging/group_vars/all/firewall.yml','inventories/production/group_vars/all/firewall.yml']]" && make lint`
Expected: prints the sorted zone list twice (`['guest', 'iot', 'lan', 'mgmt', 'srv']`); `make lint` passes.
- [ ] **Step 4: Commit**
```bash
git add inventories/staging/group_vars/all/firewall.yml inventories/production/group_vars/all/firewall.yml roles/base/defaults/main.yml
git commit -m "feat(base): shared firewall catalog/zones + firewall defaults"
```
---
### Task 3: The resolver filter plugin (TDD)
**Files:**
- Create: `roles/base/filter_plugins/firewall_rules.py`
- Test: `tests/test_firewall_rules.py`
- [ ] **Step 1: Write the failing tests**
Create `tests/test_firewall_rules.py`:
```python
import importlib.util
import pathlib
import pytest
_PATH = (
pathlib.Path(__file__).resolve().parent.parent
/ "roles" / "base" / "filter_plugins" / "firewall_rules.py"
)
_spec = importlib.util.spec_from_file_location("firewall_rules", _PATH)
fr = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(fr)
ZONES = {"lan": "10.30.0.0/24", "srv": "10.20.0.0/24"}
HOSTVARS = {
"docker01": {"ansible_host": "10.20.0.50"},
"docker02": {"ansible_host": "10.20.0.51"},
}
GROUPS = {"docker_hosts": ["docker01", "docker02"]}
def test_zone_source():
cat = {"reverse_proxy": {"host": "docker01",
"ingress": [{"from": "lan", "port": 443, "proto": "tcp"}]}}
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
assert out == [{"proto": "tcp", "port": 443, "sources": ["10.30.0.0/24"]}]
def test_service_source_resolves_to_host_ip():
cat = {
"reverse_proxy": {"host": "docker01", "ingress": []},
"photoprism": {"host": "docker01",
"ingress": [{"from": "reverse_proxy", "port": 2342, "proto": "tcp"}]},
}
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
assert out == [{"proto": "tcp", "port": 2342, "sources": ["10.20.0.50/32"]}]
def test_group_placement_and_source_multi_host():
cat = {"dns": {"group": "docker_hosts",
"ingress": [{"from": "docker_hosts", "port": 53, "proto": "udp"}]}}
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
assert out == [{"proto": "udp", "port": 53,
"sources": ["10.20.0.50/32", "10.20.0.51/32"]}]
def test_host_with_no_services_returns_empty():
cat = {"photoprism": {"host": "docker02",
"ingress": [{"from": "lan", "port": 2342, "proto": "tcp"}]}}
assert fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS) == []
def test_unresolvable_from_raises():
cat = {"x": {"host": "docker01",
"ingress": [{"from": "nope", "port": 80, "proto": "tcp"}]}}
with pytest.raises(ValueError):
fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
def test_duplicate_rules_deduped():
cat = {"app": {"host": "docker01", "ingress": [
{"from": "lan", "port": 8080, "proto": "tcp"},
{"from": "lan", "port": 8080, "proto": "tcp"},
]}}
out = fr.resolve_firewall_rules(cat, ZONES, "docker01", HOSTVARS, GROUPS)
assert out == [{"proto": "tcp", "port": 8080, "sources": ["10.30.0.0/24"]}]
def test_missing_ansible_host_raises():
cat = {"x": {"host": "docker01",
"ingress": [{"from": "docker02", "port": 80, "proto": "tcp"}]}}
with pytest.raises(ValueError):
fr.resolve_firewall_rules(cat, ZONES, "docker01", {"docker01": {}, "docker02": {}}, GROUPS)
```
- [ ] **Step 2: Run tests to verify they fail**
Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -v`
Expected: FAIL — `FileNotFoundError` / import error (the module doesn't exist yet).
- [ ] **Step 3: Write the filter plugin**
Create `roles/base/filter_plugins/firewall_rules.py`:
```python
"""Resolve the shared firewall catalog into concrete nftables ingress rules for one host.
Used by the base role's nftables template (ADR-020 / host-nftables design). Pure
functions — unit-tested in tests/test_firewall_rules.py.
"""
def _placement_hosts(entry, groups):
"""Hostnames a catalog entry is placed on (exactly one of host/group/hosts)."""
if "host" in entry:
return [entry["host"]]
if "group" in entry:
return list(groups.get(entry["group"], []))
if "hosts" in entry:
return list(entry["hosts"])
raise ValueError(f"catalog entry has no placement (host/group/hosts): {entry!r}")
def _host_cidr(host, hostvars):
hv = hostvars.get(host) or {}
ip = hv.get("ansible_host")
if not ip:
raise ValueError(f"no ansible_host for '{host}' — cannot resolve firewall source")
return f"{ip}/32"
def _resolve_source(frm, catalog, zones, hostvars, groups):
"""Resolve a symbolic `from` to a sorted list of source CIDRs."""
if frm in zones:
return [zones[frm]]
if frm in catalog:
return sorted(_host_cidr(h, hostvars)
for h in _placement_hosts(catalog[frm], groups))
if frm in groups:
return sorted(_host_cidr(h, hostvars) for h in groups[frm])
if frm in hostvars:
return [_host_cidr(frm, hostvars)]
raise ValueError(f"unresolvable firewall source '{frm}'")
def resolve_firewall_rules(catalog, zones, inventory_hostname, hostvars, groups):
"""Return sorted, de-duped [{proto, port, sources:[cidr,...]}] for services on this host."""
catalog = catalog or {}
zones = zones or {}
groups = groups or {}
rules = []
for _name, entry in sorted(catalog.items()):
if inventory_hostname not in _placement_hosts(entry, groups):
continue
for ing in entry.get("ingress", []):
rules.append({
"proto": ing.get("proto", "tcp"),
"port": int(ing["port"]),
"sources": _resolve_source(ing["from"], catalog, zones, hostvars, groups),
})
seen = set()
out = []
for r in sorted(rules, key=lambda x: (x["port"], x["proto"], x["sources"])):
key = (r["proto"], r["port"], tuple(r["sources"]))
if key not in seen:
seen.add(key)
out.append(r)
return out
class FilterModule:
"""Ansible filter plugin entry point."""
def filters(self):
return {"resolve_firewall_rules": resolve_firewall_rules}
```
- [ ] **Step 4: Run tests to verify they pass**
Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -v`
Expected: PASS (all 7 tests).
- [ ] **Step 5: Commit**
```bash
git add roles/base/filter_plugins/firewall_rules.py tests/test_firewall_rules.py
git commit -m "feat(base): firewall catalog resolver filter plugin + tests"
```
---
### Task 4: Template + render tasks + Molecule fixtures
**Files:**
- Create: `roles/base/templates/nftables.conf.j2`
- Create: `roles/base/tasks/firewall.yml`
- Replace: `roles/base/tasks/main.yml`
- Adjust: `roles/base/molecule/default/molecule.yml`
- Replace: `roles/base/molecule/default/converge.yml`
- [ ] **Step 1: Create the template**
Create `roles/base/templates/nftables.conf.j2`:
```jinja
#!/usr/sbin/nft -f
# Ansible managed — do not edit by hand. Source: roles/base (ADR-020).
flush ruleset
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
iif "lo" accept
ct state established,related accept
ct state invalid drop
iif "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept
ip protocol icmp accept
ip6 nexthdr ipv6-icmp accept
{% for r in base__firewall_resolved %}
ip saddr { {{ r.sources | join(', ') }} } {{ r.proto }} dport {{ r.port }} accept
{% endfor %}
}
chain forward { type filter hook forward priority 0; policy drop; }
chain output { type filter hook output priority 0; policy accept; }
}
include "{{ base__firewall_dropin_dir }}/*.nft"
```
- [ ] **Step 2: Create `roles/base/tasks/firewall.yml`** (render path only; apply added in Task 5)
```yaml
---
- name: Install nftables
ansible.builtin.apt:
name: nftables
state: present
tags: [firewall]
- name: Ensure nftables drop-in dir exists
ansible.builtin.file:
path: "{{ base__firewall_dropin_dir }}"
state: directory
mode: "0755"
tags: [firewall]
- name: Resolve firewall ingress rules for this host
ansible.builtin.set_fact:
base__firewall_resolved: >-
{{ firewall_catalog | default({})
| resolve_firewall_rules(firewall_zones | default({}),
inventory_hostname, hostvars, groups) }}
tags: [firewall]
- name: Render nftables ruleset (syntax-checked before install)
ansible.builtin.template:
src: nftables.conf.j2
dest: /etc/nftables.conf
mode: "0644"
validate: "nft -c -f %s"
register: base__firewall_render
tags: [firewall]
```
- [ ] **Step 3: Replace `roles/base/tasks/main.yml`**
```yaml
---
- name: Configure host firewall (nftables)
ansible.builtin.include_tasks: firewall.yml
tags: [firewall]
```
- [ ] **Step 4: Add a fixture IP in `roles/base/molecule/default/molecule.yml`**
In the `provisioner.inventory.host_vars.instance` map (which already sets
`ansible_user: root`), add `ansible_host: 10.20.0.50` so the resolver can map the
instance to an IP. The block becomes:
```yaml
provisioner:
name: ansible
inventory:
host_vars:
instance:
ansible_user: root
ansible_host: 10.20.0.50
```
(The Molecule Docker connection addresses the container by name, not `ansible_host`, so
this is data-only and won't affect connectivity.)
- [ ] **Step 5: Replace `roles/base/molecule/default/converge.yml`** with a fixture catalog and `apply: false`
```yaml
---
- name: Converge
hosts: all
become: true
gather_facts: true
vars:
base__firewall_apply: false
firewall_zones:
lan: 10.30.0.0/24
srv: 10.20.0.0/24
mgmt: 10.10.0.0/24
firewall_catalog:
reverse_proxy:
host: instance
ingress:
- { from: lan, port: 443, proto: tcp }
photoprism:
host: instance
ingress:
- { from: reverse_proxy, port: 2342, proto: tcp }
roles:
- role: base
```
- [ ] **Step 6: Run Molecule (scaffold verify still trivially passes) + lint**
Run: `make lint && make test ROLE=base`
Expected: `make lint` passes. Molecule creates the container, converges (installs nftables, renders `/etc/nftables.conf`, and the `nft -c` `validate` succeeds), passes the idempotence run (second converge reports no changes), runs the scaffold `verify.yml` (asserts `true`), and destroys. If the registry image is unreachable, report DONE_WITH_CONCERNS and confirm `make lint` + Task 3 pytest still pass.
- [ ] **Step 7: Commit**
```bash
git add roles/base/templates/nftables.conf.j2 roles/base/tasks/firewall.yml roles/base/tasks/main.yml roles/base/molecule/default/molecule.yml roles/base/molecule/default/converge.yml
git commit -m "feat(base): render nftables ruleset from catalog (+ molecule fixture)"
```
---
### Task 5: Safe apply with auto-rollback
**Files:**
- Modify: `roles/base/tasks/firewall.yml` (append the apply block)
- [ ] **Step 1: Append the safe-apply block to `roles/base/tasks/firewall.yml`**
Add at the end of the file:
```yaml
- name: Apply firewall ruleset safely (with auto-rollback)
when:
- base__firewall_apply | bool
- base__firewall_render is changed
tags: [firewall]
block:
- name: Snapshot the current ruleset as the rollback point
ansible.builtin.shell: "nft list ruleset > /etc/nftables.rollback"
changed_when: false
- name: Clear any stale rollback unit
ansible.builtin.shell: >-
systemctl stop nft-rollback.timer nft-rollback.service 2>/dev/null;
systemctl reset-failed nft-rollback.timer nft-rollback.service 2>/dev/null;
true
changed_when: false
- name: Arm the auto-rollback timer
ansible.builtin.command:
cmd: >-
systemd-run --on-active={{ base__firewall_rollback_timeout }}
--unit=nft-rollback /usr/sbin/nft -f /etc/nftables.rollback
changed_when: true
- name: Apply the new ruleset
ansible.builtin.command: nft -f /etc/nftables.conf
changed_when: true
- name: Confirm connectivity survived, then disarm the rollback
ansible.builtin.shell: >-
systemctl stop nft-rollback.timer nft-rollback.service 2>/dev/null;
systemctl reset-failed nft-rollback.timer nft-rollback.service 2>/dev/null;
true
changed_when: false
- name: Enable nftables.service so the ruleset persists across reboot
ansible.builtin.systemd:
name: nftables
enabled: true
when: base__firewall_apply | bool
tags: [firewall]
```
(The "Confirm" step runs only if the play reached it — i.e. the apply did not sever the
connection. If the apply locked the host out, the play cannot continue, the armed timer
fires after `base__firewall_rollback_timeout` seconds, and the host self-heals to the
snapshot. Molecule sets `base__firewall_apply: false`, so this block is skipped there.)
- [ ] **Step 2: Re-run Molecule + lint (apply still skipped, must stay idempotent)**
Run: `make lint && make test ROLE=base`
Expected: `make lint` passes (no `no-changed-when`/FQCN findings — every command/shell has `changed_when`). Molecule still green and idempotent (the apply block is gated off by `base__firewall_apply: false`). DONE_WITH_CONCERNS if the image is unreachable.
- [ ] **Step 3: Commit**
```bash
git add roles/base/tasks/firewall.yml
git commit -m "feat(base): safe nftables apply with systemd-run auto-rollback"
```
---
### Task 6: Molecule verify — assert rendered rules + syntax
**Files:**
- Replace: `roles/base/molecule/default/verify.yml`
- [ ] **Step 1: Replace `roles/base/molecule/default/verify.yml`**
```yaml
---
- name: Verify
hosts: all
become: true
gather_facts: false
tasks:
- name: Read the rendered ruleset
ansible.builtin.slurp:
src: /etc/nftables.conf
register: ruleset
- name: Decode it
ansible.builtin.set_fact:
nft: "{{ ruleset.content | b64decode }}"
- name: Assert default-deny input policy and management plane
ansible.builtin.assert:
that:
- "'type filter hook input priority 0; policy drop;' in nft"
- "'ct state established,related accept' in nft"
- "'iif \"wt0\" tcp dport 22 accept' in nft"
fail_msg: "input chain is missing default-deny or the management plane"
- name: Assert the lan->reverse_proxy:443 ingress rule
ansible.builtin.assert:
that:
- "'10.30.0.0/24' in nft"
- "'tcp dport 443 accept' in nft"
fail_msg: "missing lan->443 rule for reverse_proxy"
- name: Assert the reverse_proxy->photoprism:2342 ingress rule (resolved to host IP)
ansible.builtin.assert:
that:
- "'10.20.0.50/32' in nft"
- "'tcp dport 2342 accept' in nft"
fail_msg: "missing reverse_proxy->2342 rule for photoprism"
- name: Assert the docker_host extension hook is present
ansible.builtin.assert:
that:
- "'include \"/etc/nftables.d/*.nft\"' in nft"
fail_msg: "missing drop-in include hook"
- name: Syntax-check the rendered ruleset (no apply)
ansible.builtin.command: nft -c -f /etc/nftables.conf
changed_when: false
```
- [ ] **Step 2: Run the full Molecule sequence + lint**
Run: `make lint && make test ROLE=base`
Expected: `make lint` passes; Molecule converge renders, then `verify.yml` passes all
assertions and the `nft -c` check. DONE_WITH_CONCERNS if the image is unreachable (note
that the assertions could not be exercised).
- [ ] **Step 3: Commit**
```bash
git add roles/base/molecule/default/verify.yml
git commit -m "test(base): molecule verify asserts rendered firewall rules + nft -c"
```
---
### Task 7: Reflect the build in STATUS + CAPABILITIES
**Files:**
- Modify: `STATUS.md`
- Modify: `docs/CAPABILITIES.md`
- [ ] **Step 1: Update the `roles/base/` row in STATUS.md**
In `STATUS.md`, under "## Scaffolded but empty — NOT implemented", find the row:
```markdown
| `roles/base/` | Not in git — only an empty dir on disk (untracked). `site.yml` references it, so a clean clone errors on `make deploy PLAYBOOK=site` until it is built. |
```
Replace it with:
```markdown
| `roles/base/` | **Partially built.** The `firewall` concern is implemented (nftables: catalog-driven default-deny + east-west allowlist + auto-rollback apply; ADR-020) with pytest + Molecule render/syntax tests. Other concerns (SSH hardening, fail2ban, auditd, packages, users) are **not** built yet, so `make deploy PLAYBOOK=site` is still incomplete. |
```
- [ ] **Step 2: Update the firewall note in CAPABILITIES.md**
In `docs/CAPABILITIES.md` (§1 Edge & networking), find the line added for ADR-020:
```markdown
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
role from a shared `group_vars` service catalog. Both layers are still to be built._
```
Replace the final sentence so it reads:
```markdown
_Firewalling is two-layer (ADR-020): OPNsense at the perimeter + inter-VLAN, plus
per-host `nftables` (default-deny inbound + east-west allowlist) rendered by the `base`
role from a shared `group_vars` service catalog. The host `nftables` layer is built (the
`base` firewall concern); the OPNsense layer is still to be built._
```
- [ ] **Step 3: Update the `_Last reviewed_` date in STATUS.md**
In `STATUS.md`, change the `_Last reviewed: ..._` line to `_Last reviewed: 2026-06-06._`
(if it is not already that date).
- [ ] **Step 4: Verify + lint**
Run: `grep -n "Partially built" STATUS.md && grep -n "host .nftables. layer is built" docs/CAPABILITIES.md && make lint`
Expected: both greps match; `make lint` passes.
- [ ] **Step 5: Commit**
```bash
git add STATUS.md docs/CAPABILITIES.md
git commit -m "docs: record base firewall concern built (ADR-020 host layer)"
```
---
## Final verification
- [ ] `make lint` passes end to end (yamllint + ansible-lint over `roles/base` + `check-tags: OK`).
- [ ] `.venv/bin/python -m pytest tests/ -v` passes (the `check-tags` suite + the 7 new `firewall_rules` tests).
- [ ] `make test ROLE=base` is green (or DONE_WITH_CONCERNS with a clear note if the Molecule image is unreachable in this environment).
- [ ] `git log --oneline -7` shows the seven task commits.
- [ ] Sanity: `roles/base/tasks/firewall.yml` never applies when `base__firewall_apply` is false, and every `command`/`shell` task has `changed_when` (ansible-lint clean).

View file

@ -0,0 +1,480 @@
# Logging & Log Integrity Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Record the logging architecture (all logs → on-cluster Loki; a security subset also write-only off-site to `askari`) by authoring ADR-018 and reconciling every doc that touches logging/observability.
**Architecture:** Documentation-only. The runtime pieces — Alloy in the `base` role, the `loki`/`grafana` service roles, OPNsense syslog forwarding — wait on the `base` + service-role machinery STATUS.md lists as not-yet-built. This plan settles the decision and the doc reconciliation.
**Tech Stack:** Markdown. Verification is the repo's pre-commit hooks + a final cross-reference sweep. No markdown linter, so "tests" are hook-pass + grep checks.
---
## Pre-flight (read once)
- **`rbw` must be unlocked before every commit** (pre-commit ansible-lint decrypts `vault.yml`). `rbw unlocked`; if non-zero, stop and ask the user to `rbw unlock`.
- **Commit style:** one commit per task, imperative subject ≤72 chars.
- **Order:** Task 1 (ADR-018) first — later tasks link to it.
- **Spec:** `docs/superpowers/specs/2026-06-05-logging-log-integrity-design.md`.
- **Branch:** controller creates `chore/logging-log-integrity-docs` off `main` before Task 1; do not implement on `main`.
---
## File map
| File | Action | Responsibility |
|---|---|---|
| `docs/decisions/018-logging.md` | Create | Home of record for the logging architecture |
| `docs/decisions/002-security.md` | Modify | Make the "logs to central" + "active alerting" bullets concrete (→ ADR-018) |
| `docs/security/accepted-risks.md` | Modify | Add R4 — no cryptographic WORM for logs |
| `docs/CAPABILITIES.md` | Modify | Loki row → decided; add Alloy agent row; note security alerting |
| `docs/decisions/012-hardware-capacity.md` | Modify | Log-storage allocation + SSD-wearout tracked metric |
| `STATUS.md` | Modify | Rows: logging pipeline (designed, not built) |
| `docs/TODO.md` | Modify | Mark 3.1 decided; reconcile 3.6's "on askari" phrasing |
| `CLAUDE.md` | Modify | ADR-018 in Further reading |
**Deferred (not in this plan):** the Alloy task in `base`, the `loki`/`grafana` service roles, OPNsense Suricata syslog forwarding, the push-only `vault.loki.*` credential, and the live pipeline — all recorded in ADR-018/STATUS, built when the stack exists.
---
### Task 1: Author ADR-018 (the home of record)
**Files:**
- Create: `docs/decisions/018-logging.md`
- [ ] **Step 1: Create the ADR**
Create `docs/decisions/018-logging.md` with exactly this content (preserve em-dashes —, backticks, table pipes, `≠`, `~`):
```markdown
# ADR-018 — Logging and log integrity
## Context
boma wants all logs in one queryable store for troubleshooting, spotting issues over
time, and detecting intrusions / malicious activity. ADR-002 commits in principle
("logs shipped to a central location"; "active alerting wires AIDE/`auditd`/`fail2ban`/
Suricata… ties to the Loki/Grafana effort"); CAPABILITIES lists Loki and `askari` (the
off-site watchdog). Undecided: the architecture and the **integrity** question — an
attacker who roots a host will try to clear logs to cover their tracks.
The framing insight: the biggest anti-tampering win is that logs **leave the host in
near-real-time** — once a line is in a store the attacker doesn't control, wiping the
local copy is futile. How far to harden the central store is set by the threat model.
## Decision
1. **Threat model — opportunistic + blast-radius** (ADR-002 / accepted-risk R1). Not
forensic-grade.
2. **All logs → an on-cluster Loki** — the single monitoring DB for troubleshooting +
trends. Near-real-time shipping already defeats per-host track-covering.
3. **A security-relevant subset ALSO ships off-site to `askari`, write-only**
tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock** — accepted-risk R4; append-only push + off-site is the
proportionate control.
5. **Disk-wear is a managed parameter** — media choice + bounded verbosity + tuned
retention + wearout monitoring.
## Architecture
- **Agent:** Grafana Alloy on every host, installed by the `base` role — reads journald
+ container logs + security sources (`auditd`, `authpriv`, `fail2ban`, AIDE).
- **Loki (cluster):** a `loki` service role on a docker_host; all logs; monolithic
single-binary mode; NVMe; bounded retention.
- **Loki (`askari`):** the same role parameterised, in `offsite_hosts`; security subset
only, write-only, long retention, tiny volume.
- **Grafana (cluster):** both Lokis as datasources (one pane queries both); dashboards
+ the alerting ADR-002 calls for.
## Data flow & the security subset
Alloy writes everything to the cluster Loki and a filtered copy (a relabel/match stage
tags security sources `security="true"`) to the `askari` Loki. Subset: `auditd`,
`authpriv` (SSH/`sudo`), `fail2ban`, AIDE, **Suricata** (OPNsense isn't a `base` host —
it syslog-forwards its alerts to the ingest point), and key container security events.
**Write-only / append-only:** the `askari` push endpoint (`/loki/api/v1/push`) is
mesh-only with a **push-only credential**; query/admin/delete APIs are not exposed to
hosts. The push API has no edit/delete verb, so a compromised host can append but not
read/edit/delete. The cluster Loki uses the same push-only credential. Alloy buffers
(WAL) + retries across a brief outage.
## Security, integrity & residual risks
Defeats opportunistic track-covering (logs already off-host) and host-pivot-to-store
(append-only, off-cluster). The security trail survives full-cluster compromise.
Conscious residuals: append-only ≠ cryptographic WORM (root-on-`askari` could edit
chunks — R4); a few-seconds un-shipped window; agent compromise can stop *future*
shipping but not alter shipped history; **a host going silent is itself an alert**; a
stolen push credential appends noise but can't delete; an `askari` outage buffers +
flushes on reconnect.
## Retention & disk-wear
Estimates are intent-based until measured (like `/capacity-review`). Cluster Loki:
bounded hot retention (~3090 days). `askari` subset: long (~1 year+, ~525 GB/yr).
Disk-wear rules: (1) log storage on NVMe/SSD or HDD, **never SD/USB flash**; (2) bounded
verbosity at source (sane levels, selective access logging, a targeted `auditd`
ruleset); (3) tuned Loki retention/compaction; (4) SSD **wearout/TBW** is a monitored
metric (Proxmox wearout %, `node_exporter` smartmon) with an alert. Log storage is a
tracked allocation in `docs/hardware/reference.md` (ADR-012).
## Status
Designed. **Authorable now:** this ADR + the ADR-002/CAPABILITIES/ADR-012/
accepted-risks/STATUS/TODO reconciliations. **Deferred on the stack:** Alloy-in-`base`,
the `loki`/`grafana` service roles, OPNsense syslog config, the push-only credential,
and the live pipeline.
## Dependencies
`base` role + service-role machinery (unbuilt, STATUS.md); the running cluster +
`askari` (`offsite_hosts`, ADR-016); OPNsense automation for Suricata syslog (ADR-007);
the metrics stack (Prometheus / `node_exporter`) for SSD-wearout + log-silence alerting
(sibling effort, TODO 3.6).
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose is disk-hungry on a small VPS; keep volume where storage is cheap and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail must be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice (one agent for logs, later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/018-logging.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/018-logging.md
git commit -m "Add ADR-018 (logging and log integrity)"
```
---
### Task 2: Make ADR-002's logging bullets concrete
**Files:**
- Modify: `docs/decisions/002-security.md`
Read the file first, then two exact edits.
- [ ] **Step 1: The audit-trail bullet**
Find:
```
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location if a log aggregation service is available
```
Replace with:
```
- `auditd` installed and running with a baseline ruleset
- Logs shipped to a central location in near-real-time — all logs to an on-cluster
Loki, plus a security-relevant subset write-only off-site to `askari` so the audit
trail survives host (and full-cluster) compromise (ADR-018)
```
- [ ] **Step 2: The active-alerting bullet**
Find:
```
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata into the
monitoring/alerting stack (planned; ties to the Loki/Grafana effort)
```
Replace with:
```
- **Active alerting** wires AIDE, `auditd`, `fail2ban`, and Suricata — plus
log-source-silence (a host that stops shipping) — into Grafana alerting on the
Loki/Grafana stack (ADR-018; planned)
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/002-security.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/002-security.md
git commit -m "ADR-002: make central-logging + alerting controls concrete (ADR-018)"
```
---
### Task 3: Add accepted-risk R4 (no WORM for logs)
**Files:**
- Modify: `docs/security/accepted-risks.md`
Read the file first, then one exact edit (add R4 after R3).
- [ ] **Step 1: Add the R4 row**
Find this exact line (the R3 row):
```
| R3 | **Self-hosted mesh control plane is a public target on `askari`** — the NetBird coordinator (ADR-016) exposes a management API + dashboard (TCP 80/443) and Coturn (UDP 3478) on `askari`'s public IP; the management API controls the whole mesh | Self-hosting means **no third-party trust** and an off-site control plane that survives a homelab outage (boma's sovereignty ethos). Residual surface is on `askari` (already a public VPS) and is mitigated: TLS + embedded-IdP login, source-IP restriction where practical, `base` hardening, version-pinned NetBird (ADR-011) patched on boma's cadence | A coordinator compromise or unpatched NetBird CVE; the management plane is reachable without auth/IP-limits; the operational burden makes a hosted coordinator worth reconsidering |
```
Add immediately **after** it:
```
| R4 | **No cryptographic WORM for logs** — shipped logs are append-only via Loki's push API and copied off-site to `askari` (ADR-018), but the stored chunks are not object-locked/immutable; a root-on-`askari` attacker could edit history | Append-only push + off-site copy already defeats the realistic threat (a host attacker covering tracks survives even full-cluster compromise). True WORM (object-lock) is forensic-grade cost for boma's opportunistic threat model (R1) | Threat model shifts toward targeted/forensic; a regulatory/evidentiary need appears; `askari` itself is assessed as a likely target |
```
- [ ] **Step 2: Bump the "Last reviewed" date**
Find:
```
_Last reviewed: 2026-06-05. The prior gaps
```
Replace with:
```
_Last reviewed: 2026-06-06. The prior gaps
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/security/accepted-risks.md`
Expected: Passed/Skipped.
```bash
git add docs/security/accepted-risks.md
git commit -m "accepted-risks: add R4 (no cryptographic WORM for logs)"
```
---
### Task 4: Update CAPABILITIES §3 (Observability)
**Files:**
- Modify: `docs/CAPABILITIES.md`
Read the file first, then three exact edits.
- [ ] **Step 1: Loki row → decided, note the off-site sink**
Find:
```
| Logs | Loki | P | planned | Log aggregation | TODO 3.6 |
```
Replace with:
```
| Logs | Loki (cluster all-logs + off-site security subset on `askari`) | P | core | Central log aggregation; a security subset ships write-only off-site (append-only) | **Decided (ADR-018)** |
```
- [ ] **Step 2: Add the Alloy agent row** (right after the Loki row just edited)
Find:
```
| Dashboards | Grafana | P | planned | Visualisation + alerting | TODO 3.6 |
```
Replace with:
```
| Log shipping agent | Grafana Alloy (in `base`) | P | core | Collects journald + container + security logs on every host; ships to Loki (ADR-018) | **Decided (ADR-018)** |
| Dashboards | Grafana | P | planned | Visualisation + alerting (incl. AIDE/`auditd`/`fail2ban`/Suricata + log-silence — ADR-018) | TODO 3.6 |
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/CAPABILITIES.md`
Expected: Passed/Skipped.
```bash
git add docs/CAPABILITIES.md
git commit -m "CAPABILITIES: Loki decided + Alloy agent + security alerting (ADR-018)"
```
---
### Task 5: ADR-012 — log-storage allocation + wearout metric
**Files:**
- Modify: `docs/decisions/012-hardware-capacity.md`
Read the file first, then one exact edit (add a Consequences bullet).
- [ ] **Step 1: Add a Consequences bullet**
Find this exact block:
```
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
```
Replace with:
```
## Consequences
- Right-sizing advice is intent-based until usage data exists; reports say so.
- `reference.md` table headers are a parser contract — changing them needs a
matching `capacity-scan.py` change.
- Log storage (ADR-018) is a tracked allocation: the cluster Loki host's retention
budget and `askari`'s security-subset volume belong in `reference.md`, and SSD
**wearout/TBW** is a monitored metric — logging is write-heavy, so wear is watched,
not assumed.
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/decisions/012-hardware-capacity.md`
Expected: Passed/Skipped.
```bash
git add docs/decisions/012-hardware-capacity.md
git commit -m "ADR-012: track log-storage allocation + SSD wearout (ADR-018)"
```
---
### Task 6: Add logging rows to STATUS.md
**Files:**
- Modify: `STATUS.md`
Read the file first, then one exact edit (add two rows after the Level 4 row).
- [ ] **Step 1: Add the rows**
Find this exact line:
```
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
```
Replace with that SAME line followed by the two new rows:
```
| Service-UI verification (Level 4) | ADR-017 / ADR-008 | **Design RESOLVED** (ADR-017 + spec + plan); resolves ADR-015 deferred #2. `/verify-service` skill + `VERIFY.md` template + standards are authorable and present. **Build pending:** running needs ubongo + `playwright` plugin + Authentik + a staging deploy. |
| Logging pipeline (Loki + Alloy + off-site subset) | ADR-018 | **Design RESOLVED** (ADR-018 + spec). All logs → on-cluster Loki; security subset write-only off-site to askari. **Build pending:** Alloy in `base`, `loki`/`grafana` service roles, OPNsense syslog — none built. |
| Security alerting (AIDE/auditd/fail2ban/Suricata + log-silence) | ADR-002 / ADR-018 | Wired into Grafana on the Loki stack. Designed; depends on the logging pipeline + metrics stack (TODO 3.6). |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files STATUS.md`
Expected: Passed/Skipped.
```bash
git add STATUS.md
git commit -m "STATUS: record logging pipeline + security alerting (ADR-018)"
```
---
### Task 7: Reconcile TODO 3.1 and 3.6
**Files:**
- Modify: `docs/TODO.md`
Read the file first, then two exact edits. (Preserve the `~~strikethrough~~` markers.)
- [ ] **Step 1: Mark 3.1 decided**
Find:
```
3. **Building services**
1. Decide how to manage logs.
```
Replace with:
```
3. **Building services**
1. ~~Decide how to manage logs.~~ DECIDED (ADR-018): all logs → on-cluster Loki via
Grafana Alloy (in `base`); a security subset also ships write-only off-site to
`askari` (append-only); Grafana queries both. WORM skipped (accepted-risk R4).
```
- [ ] **Step 2: Reconcile 3.6's "on askari" phrasing**
Find:
```
6. Wire up Loki, Prometheus, Grafana dashboards, Grafana alerts, and Uptime
Kuma alerts on askari.
```
Replace with:
```
6. Wire up the monitoring stack. Logging topology DECIDED (ADR-018): cluster Loki
(all logs) + off-site security subset on `askari` + Grafana on-cluster (not the
whole stack on `askari`). Still to design/build: Prometheus + metric exporters,
Uptime Kuma, and exactly which alerts live where.
```
- [ ] **Step 3: Verify and commit**
Run: `rbw unlocked && pre-commit run --files docs/TODO.md`
Expected: Passed/Skipped.
```bash
git add docs/TODO.md
git commit -m "TODO: mark log management decided (ADR-018); reconcile 3.6"
```
---
### Task 8: Link ADR-018 from CLAUDE.md
**Files:**
- Modify: `CLAUDE.md`
Read the file first, then one exact edit.
- [ ] **Step 1: Add the Further-reading row after Hardware & capacity**
Find:
```
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
```
Replace with that SAME line followed by the new row:
```
| Hardware & capacity | `docs/decisions/012-hardware-capacity.md` |
| Logging & log integrity | `docs/decisions/018-logging.md` |
```
- [ ] **Step 2: Verify and commit**
Run: `rbw unlocked && pre-commit run --files CLAUDE.md`
Expected: Passed/Skipped.
```bash
git add CLAUDE.md
git commit -m "CLAUDE.md: link ADR-018 (logging)"
```
---
### Task 9: Final consistency sweep
**Files:** none modified (verification only)
- [ ] **Step 1: ADR-018 present + cross-linked (canonical docs only)**
Run:
```bash
test -f docs/decisions/018-logging.md && echo "ADR-018 present"
grep -rl "ADR-018\|018-logging" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: the file exists and the referencing docs appear — ADR-002, accepted-risks, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md.
- [ ] **Step 2: No stale "logging undecided / if available" language**
Run:
```bash
grep -rniE "log aggregation service is available|Logs \| Loki \| P \| planned|Decide how to manage logs\.($|[^~])" docs/ CLAUDE.md STATUS.md | grep -vE "superpowers/(plans|specs)/"
```
Expected: no hits — the ADR-002 conditional, the "planned" Loki row, and the open "Decide how to manage logs" TODO are all now updated.
- [ ] **Step 3: Full hook run**
Run: `rbw unlocked && pre-commit run --all-files`
Expected: all hooks Passed/Skipped. Fix anything that fails (likely trailing whitespace / end-of-file) and amend the owning commit.
- [ ] **Step 4: Push (only if the user asks)**
```bash
git push origin <branch-or-main-after-merge>
```
---
## Self-review notes (author)
- **Spec coverage:** decision/architecture/data-flow/security/retention → Task 1 (ADR-018); the spec's "Documentation & implementation changes" table → Tasks 28 (ADR-002, accepted-risks R4, CAPABILITIES, ADR-012, STATUS, TODO, CLAUDE.md). The role/pipeline rows in that table are deferred (recorded in ADR-018/STATUS), not implemented here. ✓
- **Deferred, intentional:** Alloy-in-`base`, the `loki`/`grafana` service roles, OPNsense syslog forwarding, the `vault.loki.*` credential, the metrics-stack dependency — all need the unbuilt machinery; named in ADR-018/STATUS. ✓
- **No placeholders:** every create/edit shows exact text. ✓
- **Name consistency:** `ADR-018` / `018-logging.md`, "security subset", `offsite_hosts`, Grafana Alloy, push-only credential, R4 used identically across tasks. ✓
```

View file

@ -0,0 +1,728 @@
# Ansible Tagging Standard Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Establish a two-tier Ansible tagging standard (role-name tags + a closed concern list) with machine-enforced vocabulary, plus a Proxmox VM metadata-tag convention, so playbook runs are targeted, transparent, and predictable.
**Architecture:** A single source-of-truth YAML (`tests/tags.yml`) lists the allowed concern/special/opt-in/playbook tags. A Python checker (`scripts/check-tags.py`) scans `roles/` and `playbooks/`, computes the allowed set as `{role dir names} {tags.yml entries}`, and fails `make lint` on any unknown tag. Terraform gets a documented three-tag VM convention (metadata only). The standard is recorded as ADR-019 and folded into CLAUDE.md.
**Tech Stack:** Python 3 (stdlib + PyYAML, already present via ansible-core), pytest (already in `requirements.txt`), Make, Terraform (HCL edit only — not `init`ed), Markdown docs.
---
## File structure
| File | Responsibility | Action |
|------|----------------|--------|
| `tests/tags.yml` | Single source of truth: allowed concern/special/opt-in/playbook tags | Create |
| `scripts/check-tags.py` | Scan `roles/`+`playbooks/`, fail on tags outside the allowed set | Create |
| `tests/test_check_tags.py` | Unit tests for the checker (mirrors `tests/test_capacity_scan.py`) | Create |
| `Makefile` | Wire `check-tags.py` into the `lint` target | Modify |
| `playbooks/site.yml` | Fix `docker_host` role tag (`docker``docker_host`) | Modify |
| `docs/decisions/019-tagging.md` | The ADR (the standard itself) | Create |
| `CLAUDE.md` | Reword tag rule; add Proxmox tag convention; add ADR-019 to Further reading | Modify |
| `terraform/environments/staging/main.tf` | Add `managed-by=terraform` tag | Modify |
| `terraform/environments/production/main.tf` | Add `managed-by=terraform` tag | Modify |
| `docs/TODO.md` | Mark 3.7 and 3.11 DECIDED | Modify |
| `docs/CAPABILITIES.md` | Note targeted runs as a capability | Modify |
Notes for the implementer:
- The repo venv is `.venv`. Run Python as `.venv/bin/python` (Makefile vars: `PYTHON := .venv/bin/python`). If `.venv` is missing, run `make setup` first.
- PyYAML is available in the venv (ansible-core depends on it) — `import yaml` works.
- Terraform is **not** `init`ed in this repo, so `terraform validate`/`plan` will fail offline. Only use `terraform fmt` (offline-safe) for the HCL tasks.
- Before any `git commit`, the pre-commit hook decrypts `vault.yml`, so the vault agent must be unlocked: run `rbw unlocked` (exit 0 = good). If locked, ask the user to `rbw unlock` and wait. None of these tasks touch vault files, but the hook still runs.
---
### Task 1: Tag vocabulary file (`tests/tags.yml`)
**Files:**
- Create: `tests/tags.yml`
- [ ] **Step 1: Create the vocabulary file**
Create `tests/tags.yml` with exactly this content:
```yaml
---
# Allowed Ansible tag vocabulary — single source of truth for scripts/check-tags.py.
# Authoritative reference & rationale: docs/decisions/019-tagging.md.
#
# The full allowed set the linter enforces is:
# {role directory names under roles/} everything listed below.
#
# To add a CONCERN tag: add it here AND add a row to the ADR-019 table with a
# one-line justification (cross-cutting, used in 2+ roles, distinct).
# Cross-cutting concern tags, applied per-task/block where a task belongs to the
# concern. Targeted one at a time (tags are union/OR, never intersected).
concerns:
- packages # apt package install/management
- users # accounts, groups, sudo
- firewall # nftables rulesets & port definitions (ADR-002)
- hardening # security baseline — sshd config, fail2ban, auditd, sysctl
- logging # Alloy / log-shipping config (ADR-018)
- monitoring # metric exporters / health checks
- config # render templated config/compose files to disk — no restart
- deploy # bring services up / restart (compose up -d)
- proxy # reverse-proxy + TLS registration (Traefik routes, Authentik)
# Ansible built-in special tags. Narrow use only:
# always — cheap preflight assertions (run regardless of --tags)
# never — destructive/expensive tasks, paired with an opt-in tag below
special:
- always
- never
# `never`-paired opt-in tags: destructive/expensive tasks that only run when
# named explicitly (e.g. `tags: [never, force_pull]`). Empty until a role adds one.
opt_ins: []
# Playbook-level identity tags for role-less lifecycle plays (e.g. bootstrap.yml).
playbooks:
- bootstrap
```
- [ ] **Step 2: Verify it parses and has the expected shape**
Run:
```bash
.venv/bin/python -c "import yaml; d=yaml.safe_load(open('tests/tags.yml')); assert len(d['concerns'])==9, d['concerns']; assert d['special']==['always','never']; assert d['opt_ins']==[]; assert d['playbooks']==['bootstrap']; print('tags.yml OK')"
```
Expected: prints `tags.yml OK` and exits 0.
- [ ] **Step 3: Commit**
```bash
git add tests/tags.yml
git commit -m "feat(tags): add allowed-tag vocabulary (tests/tags.yml)"
```
---
### Task 2: Checker core — tag collection & allowed-set helpers
**Files:**
- Create: `scripts/check-tags.py`
- Test: `tests/test_check_tags.py`
- [ ] **Step 1: Write the failing tests**
Create `tests/test_check_tags.py`:
```python
import importlib.util
import pathlib
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "check-tags.py"
_spec = importlib.util.spec_from_file_location("check_tags", _PATH)
ct = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(ct)
def test_collect_tags_list_form():
node = {"name": "t", "tags": ["firewall", "users"]}
assert ct.collect_tags(node) == {"firewall", "users"}
def test_collect_tags_string_form():
node = {"name": "t", "tags": "always"}
assert ct.collect_tags(node) == {"always"}
def test_collect_tags_nested_blocks_and_roles():
doc = [
{"hosts": "all", "roles": [{"role": "base", "tags": ["base"]}]},
{"block": [{"name": "x", "tags": ["config"]}], "tags": ["deploy"]},
]
assert ct.collect_tags(doc) == {"base", "config", "deploy"}
def test_collect_tags_ignores_templated_values():
node = {"tags": ["{{ dynamic }}", "logging"]}
assert ct.collect_tags(node) == {"logging"}
def test_load_vocab_unions_all_categories():
vocab = ct.load_vocab()
assert "firewall" in vocab # concern
assert "always" in vocab # special
assert "bootstrap" in vocab # playbook identity
assert len([c for c in vocab]) >= 12
def test_role_names_reads_role_dirs():
names = ct.role_names()
assert "base" in names
assert "docker_host" in names
```
- [ ] **Step 2: Run tests to verify they fail**
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
Expected: FAIL — `ModuleNotFoundError` / file not found for `scripts/check-tags.py` (the module can't be imported yet).
- [ ] **Step 3: Write the minimal implementation**
Create `scripts/check-tags.py`:
```python
#!/usr/bin/env python3
"""
Validate that every Ansible tag used under roles/ and playbooks/ belongs to the
approved vocabulary. Single source of truth: tests/tags.yml. Rationale: ADR-019.
Allowed set = {role directory names under roles/} {concerns, special, opt_ins,
playbooks from tests/tags.yml}. Templated tags (containing "{{") are skipped —
they can't be statically validated.
Usage: python3 scripts/check-tags.py
Exit 0 = all tags allowed; exit 1 = unknown tag(s) found.
"""
import pathlib
import sys
import yaml
REPO = pathlib.Path(__file__).resolve().parent.parent
VOCAB_FILE = REPO / "tests" / "tags.yml"
SCAN_DIRS = ("roles", "playbooks")
class _IgnoreUnknownTags(yaml.SafeLoader):
"""SafeLoader that tolerates custom YAML tags (e.g. !vault) instead of crashing."""
def _ignore(loader, tag_suffix, node):
return None
_IgnoreUnknownTags.add_multi_constructor("", _ignore)
_IgnoreUnknownTags.add_multi_constructor("!", _ignore)
def _static_str(value):
return isinstance(value, str) and "{{" not in value
def load_vocab(path=VOCAB_FILE):
data = yaml.safe_load(path.read_text()) or {}
vocab = set()
for key in ("concerns", "special", "opt_ins", "playbooks"):
vocab.update(data.get(key) or [])
return vocab
def role_names(repo=REPO):
roles_dir = repo / "roles"
if not roles_dir.is_dir():
return set()
return {p.name for p in roles_dir.iterdir() if p.is_dir()}
def collect_tags(node):
"""Recursively collect every static tag string under any 'tags:' key."""
tags = set()
if isinstance(node, dict):
for key, value in node.items():
if key == "tags":
if _static_str(value):
tags.add(value)
elif isinstance(value, list):
tags.update(t for t in value if _static_str(t))
tags |= collect_tags(value)
elif isinstance(node, list):
for item in node:
tags |= collect_tags(item)
return tags
if __name__ == "__main__": # pragma: no cover
sys.exit(0)
```
- [ ] **Step 4: Run tests to verify they pass**
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
Expected: PASS (all 6 tests).
- [ ] **Step 5: Commit**
```bash
git add scripts/check-tags.py tests/test_check_tags.py
git commit -m "feat(tags): checker helpers — tag collection & allowed-set"
```
---
### Task 3: Checker validation — scan files and fail on unknown tags
**Files:**
- Modify: `scripts/check-tags.py`
- Test: `tests/test_check_tags.py`
- [ ] **Step 1: Write the failing tests**
Append to `tests/test_check_tags.py`:
```python
def test_scan_text_collects_from_yaml_string():
text = """
- hosts: all
roles:
- role: base
tags: [base]
tasks:
- name: open port
tags: [firewall]
"""
assert ct.scan_text(text) == {"base", "firewall"}
def test_scan_text_tolerates_custom_yaml_tags():
text = "- name: t\n secret: !vault xxx\n tags: [users]\n"
assert ct.scan_text(text) == {"users"}
def test_find_violations_flags_unknown_tag():
allowed = {"base", "firewall"}
used = {"base", "frewall"} # typo
assert ct.find_violations(used, allowed) == ["frewall"]
def test_find_violations_empty_when_all_allowed():
assert ct.find_violations({"base", "firewall"}, {"base", "firewall"}) == []
```
- [ ] **Step 2: Run tests to verify they fail**
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
Expected: FAIL — `AttributeError: module 'check_tags' has no attribute 'scan_text'` (and `find_violations`).
- [ ] **Step 3: Add the scanning + validation functions**
In `scripts/check-tags.py`, replace the final block:
```python
if __name__ == "__main__": # pragma: no cover
sys.exit(0)
```
with:
```python
def scan_text(text):
"""Collect static tags from a (possibly multi-document) YAML string."""
found = set()
for doc in yaml.load_all(text, Loader=_IgnoreUnknownTags):
found |= collect_tags(doc)
return found
def iter_yaml_files(repo=REPO, scan_dirs=SCAN_DIRS):
for name in scan_dirs:
base = repo / name
if not base.is_dir():
continue
for ext in ("*.yml", "*.yaml"):
yield from sorted(base.rglob(ext))
def find_violations(used, allowed):
return sorted(used - allowed)
def main():
allowed = load_vocab() | role_names()
violations = []
for path in iter_yaml_files():
try:
used = scan_text(path.read_text())
except yaml.YAMLError as exc:
print(f"warning: could not parse {path}: {exc}", file=sys.stderr)
continue
for tag in find_violations(used, allowed):
violations.append((path.relative_to(REPO), tag))
if violations:
print(
"error: Ansible tag(s) not in tests/tags.yml or role names "
"(see docs/decisions/019-tagging.md):",
file=sys.stderr,
)
for relpath, tag in violations:
print(f" {relpath}: '{tag}'", file=sys.stderr)
print(f"\nallowed: {', '.join(sorted(allowed))}", file=sys.stderr)
sys.exit(1)
print(f"check-tags: OK ({len(allowed)} tags allowed across {len(SCAN_DIRS)} dirs)")
if __name__ == "__main__":
main()
```
- [ ] **Step 4: Run tests to verify they pass**
Run: `.venv/bin/python -m pytest tests/test_check_tags.py -v`
Expected: PASS (all 10 tests).
- [ ] **Step 5: Commit**
```bash
git add scripts/check-tags.py tests/test_check_tags.py
git commit -m "feat(tags): scan roles/+playbooks/ and fail on unknown tags"
```
---
### Task 4: Reconcile existing tags & wire into `make lint`
**Files:**
- Modify: `playbooks/site.yml:18-19`
- Modify: `Makefile` (the `lint:` target)
- [ ] **Step 1: Run the checker against the current repo (expect one violation)**
Run: `.venv/bin/python scripts/check-tags.py`
Expected: FAIL (exit 1) reporting `playbooks/site.yml: 'docker'` — because the `docker_host` role is tagged `[docker]`, which is neither a role name nor a vocabulary tag. This confirms the checker works end-to-end.
- [ ] **Step 2: Fix the role tag to equal the role name**
In `playbooks/site.yml`, change:
```yaml
- role: docker_host
tags: [docker]
```
to:
```yaml
- role: docker_host
tags: [docker_host]
```
- [ ] **Step 3: Re-run the checker (expect clean)**
Run: `.venv/bin/python scripts/check-tags.py`
Expected: PASS — prints `check-tags: OK (... tags allowed across 2 dirs)` and exits 0.
(Allowed set now includes role names `base`, `docker_host`; used tags are `base`, `docker_host`, `bootstrap` — all allowed.)
- [ ] **Step 4: Wire the checker into `make lint`**
In `Makefile`, change the `lint:` target from:
```makefile
lint:
$(VENV)/bin/yamllint .
$(LINT)
```
to:
```makefile
lint:
$(VENV)/bin/yamllint .
$(LINT)
$(PYTHON) scripts/check-tags.py
```
- [ ] **Step 5: Run the full lint suite and the test suite**
Run: `make lint && .venv/bin/python -m pytest tests/test_check_tags.py -v`
Expected: yamllint passes, ansible-lint passes, `check-tags: OK`, and all pytest tests PASS.
- [ ] **Step 6: Commit**
```bash
git add playbooks/site.yml Makefile
git commit -m "feat(tags): enforce tag vocabulary in make lint; fix docker_host tag"
```
---
### Task 5: Terraform Proxmox VM tag convention
**Files:**
- Modify: `terraform/environments/staging/main.tf` (the `tags =` line in `module "vms"`)
- Modify: `terraform/environments/production/main.tf` (the `tags =` line in `module "vms"`)
- [ ] **Step 1: Add `managed-by=terraform` to the staging VM tags**
In `terraform/environments/staging/main.tf`, change:
```hcl
tags = ["staging", each.value.group]
```
to:
```hcl
tags = ["staging", each.value.group, "managed-by=terraform"]
```
- [ ] **Step 2: Add `managed-by=terraform` to the production VM tags**
In `terraform/environments/production/main.tf`, change:
```hcl
tags = ["production", each.value.group]
```
to:
```hcl
tags = ["production", each.value.group, "managed-by=terraform"]
```
- [ ] **Step 3: Format-check the HCL (offline-safe)**
Run: `terraform -chdir=terraform/environments/staging fmt && terraform -chdir=terraform/environments/production fmt`
Expected: either no output (already formatted) or the filename printed (reformatted). Exit 0.
(Do NOT run `terraform validate`/`plan` — Terraform is not `init`ed in this repo and they will fail offline.)
- [ ] **Step 4: Confirm the edits**
Run: `grep -n "managed-by=terraform" terraform/environments/staging/main.tf terraform/environments/production/main.tf`
Expected: one match in each file.
- [ ] **Step 5: Commit**
```bash
git add terraform/environments/staging/main.tf terraform/environments/production/main.tf
git commit -m "feat(tags): Proxmox VM metadata convention (managed-by=terraform)"
```
---
### Task 6: Documentation — ADR-019, CLAUDE.md, TODO, CAPABILITIES
**Files:**
- Create: `docs/decisions/019-tagging.md`
- Modify: `CLAUDE.md` (Ansible conventions; Terraform conventions; Further reading)
- Modify: `docs/TODO.md` (items 3.7 and 3.11)
- Modify: `docs/CAPABILITIES.md`
- [ ] **Step 1: Write the ADR**
Create `docs/decisions/019-tagging.md`:
````markdown
# ADR-019 — Tagging standard for targeted, predictable runs
## Status
Accepted (2026-06-06). Resolves TODO 3.7 ("Define a tagging standard that lets us
target runs without over-tagging") and TODO 3.11 ("Deliberate tagging strategy").
## Context
boma wants to run playbooks **targeted** — a single service, a single layer, or a
single cross-cutting concern — **transparently and predictably**: a reader should
know from a `--tags` invocation exactly what it will and won't touch. CLAUDE.md
already requires tag-filterable tasks, but no vocabulary or convention existed, and
the TODO explicitly warns against the opposite failure mode: **over-tagging**.
## Decision
### Two-tier tagging
**Tier 1 — role/service tag (mechanical).** The tag equals the role name, applied
once at the role-import level:
```yaml
roles:
- role: photoprism
tags: [photoprism]
```
Ansible propagates it to every task in the role. Because one service = one role
(ADR-004), this single rule covers both the *layer/role* and *single-service*
targeting axes with zero per-task burden. Role-less lifecycle playbooks
(e.g. `bootstrap.yml`) carry a single playbook-identity tag instead.
**Tier 2 — concern tag (curated).** A small **closed list** of cross-cutting concern
tags, applied per-task/block **only where a task genuinely belongs to that concern**.
### The closed concern list
A concern earns a tag only if it (a) appears in 2+ roles, (b) is worth running as a
slice on its own, and (c) doesn't overlap confusingly with another.
| Tag | Covers |
|-----|--------|
| `packages` | apt package install/management |
| `users` | accounts, groups, sudo |
| `firewall` | nftables rulesets & port definitions (ADR-002) |
| `hardening` | security baseline — sshd config, fail2ban, auditd, sysctl |
| `logging` | Alloy / log-shipping config (ADR-018) |
| `monitoring` | metric exporters / health checks |
| `config` | render templated config/compose files to disk — **no restart** |
| `deploy` | bring services up / restart (`compose up -d`) |
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
The `config`/`deploy` split lets you re-render and diff configuration (`--tags
config`) without bouncing services, then restart deliberately (`--tags deploy`).
`backup` and `secrets` are intentionally omitted until the roles needing them exist.
### `always` / `never`
- **`always`** — reserved for cheap preflight assertions (vault unlocked, OS is
Debian 13, required vars present), so even `--tags config` runs its safety guards.
- **`never`** — reserved for destructive/expensive opt-in tasks, each paired with a
descriptive tag (e.g. `tags: [never, force_pull]`); they run only when named.
### Predictability principle: tags are union-only
`--tags a,b` runs tasks tagged a **OR** b — Ansible has no native AND. boma therefore
targets **one axis at a time**: either a role/service *or* a concern, never an
intersection like "photoprism's firewall only." If that's ever needed, just run
`--tags photoprism` (idempotent and fast). Designing for intersection is the
over-tagging trap; we decline it on purpose.
### Terraform / Proxmox VM tags (metadata only)
Every Terraform-managed VM carries exactly three Proxmox tags:
| Tag | Value | Purpose |
|-----|-------|---------|
| env | `staging` \| `production` | which environment |
| role/group | `docker_hosts`, `proxmox_hosts`, … | matches the inventory group |
| managed-by | `terraform` | distinguishes IaC VMs from hand-made ones |
These are **pure metadata for transparency** (glanceable in the Proxmox UI). They do
**not** drive run-targeting and do **not** feed inventory — `scripts/tf_to_inventory.py`
keeps building groups from the `group` output field, the single source of truth.
## Enforcement
`tests/tags.yml` is the single source of truth for the allowed concern/special/
opt-in/playbook tags. `scripts/check-tags.py` (run by `make lint`, covered by
`tests/test_check_tags.py`) scans `roles/` and `playbooks/` and fails on any tag
outside `{role directory names} {tests/tags.yml entries}`.
## Extending the vocabulary
To add a concern tag: (1) add it to `tests/tags.yml`; (2) add a row to the concern
table above with a one-line justification showing it passes the litmus test
(cross-cutting, 2+ roles, distinct). That is the whole gate — lightweight, but it
leaves a paper trail.
## Consequences
- Targeted runs are predictable: only two kinds of tags exist, one of them mechanical.
- Over-tagging is structurally resisted (closed list + lint enforcement).
- Intersection targeting is unavailable by design.
- Authors must keep role tags = role names; the linter enforces it.
## Related
ADR-002 (security baseline / firewall), ADR-004 (one service = one role),
ADR-009 (TF↔Ansible handoff / inventory), ADR-018 (logging).
````
- [ ] **Step 2: Reword the tag rule in CLAUDE.md**
In `CLAUDE.md`, under **Ansible conventions**, change:
```markdown
- **Tags**: every task must have at least one tag; playbooks support `--tags` filtering
```
to:
```markdown
- **Tags** (ADR-019): import each role with its role-name tag once at the play level
(Ansible inherits it to every task). Tag a task/block with a concern tag from the
approved list (`tests/tags.yml`) only where it genuinely belongs to that concern —
don't invent tags or tag for tagging's sake. Target one axis at a time (role/service
*or* concern; tags are union/OR, never intersected). `make lint` enforces the vocabulary.
```
- [ ] **Step 3: Add the Proxmox tag convention to CLAUDE.md**
In `CLAUDE.md`, under **Terraform conventions**, add this bullet after the existing
"Terraform owns VM existence only" bullet:
```markdown
- Every TF-managed VM carries three Proxmox tags — `<env>`, its inventory `group`, and
`managed-by=terraform` — as **metadata only** (ADR-019). They do not feed inventory
or run-targeting; `tf_to_inventory.py` still groups by the `group` output field.
```
- [ ] **Step 4: Add ADR-019 to the Further reading table**
In `CLAUDE.md`, in the **Further reading** table, add this row immediately after the
`Logging & log integrity` row:
```markdown
| Tagging & run-targeting | `docs/decisions/019-tagging.md` |
```
- [ ] **Step 5: Mark the TODO items decided**
In `docs/TODO.md`, change line for item 3.7:
```markdown
7. Define a tagging standard that lets us target runs without over-tagging.
```
to:
```markdown
7. ~~Define a tagging standard that lets us target runs without over-tagging.~~
DECIDED (ADR-019): two-tier — role-name tags (auto, at play level) + a closed
9-tag concern list (`tests/tags.yml`); union-only targeting; enforced by `make lint`.
```
and change item 3.11:
```markdown
11. Deliberate tagging strategy.
```
to:
```markdown
11. ~~Deliberate tagging strategy.~~ DECIDED (ADR-019) — folded into 3.7.
```
- [ ] **Step 6: Note the capability in CAPABILITIES.md**
Run: `grep -n "^## \|^### " docs/CAPABILITIES.md` to locate the section covering
operations / CI / how playbooks are run. Add this bullet under the most appropriate
existing section (operations or testing/CI):
```markdown
- **Targeted runs** (ADR-019): playbooks are sliced with `--tags` along two axes —
role/service (tag = role name) or a closed list of cross-cutting concerns
(`firewall`, `logging`, `config`, `deploy`, …); the vocabulary is lint-enforced.
```
- [ ] **Step 7: Verify docs are consistent and lint still passes**
Run:
```bash
grep -n "019-tagging" CLAUDE.md && grep -c "managed-by=terraform" CLAUDE.md && make lint
```
Expected: the ADR-019 row is found in CLAUDE.md, `managed-by=terraform` appears at
least once, and `make lint` passes (including `check-tags: OK`).
- [ ] **Step 8: Commit**
```bash
git add docs/decisions/019-tagging.md CLAUDE.md docs/TODO.md docs/CAPABILITIES.md
git commit -m "docs(tags): ADR-019 + CLAUDE.md/TODO/CAPABILITIES (tagging standard)"
```
---
## Final verification
- [ ] Run the full suite once more: `make lint && .venv/bin/python -m pytest tests/ -v`
Expected: yamllint + ansible-lint pass, `check-tags: OK`, all tests PASS.
- [ ] Confirm a deliberate violation is caught: temporarily add `tags: [bogus]` to a
task in `playbooks/site.yml`, run `.venv/bin/python scripts/check-tags.py`, confirm it
exits 1 reporting `'bogus'`, then revert the edit.
- [ ] `git log --oneline -7` shows the six task commits.

View file

@ -0,0 +1,544 @@
# Operational Access (ADR-021) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Establish operational access as a deployment deliverable — a documented, verifiable set of mesh-reachable troubleshooting paths for every host and service — by writing ADR-021, reconciling the latent ADR-016/020 SSH contradiction, adding the control-node SSH source to the host firewall, and wiring the `ACCESS.md` record + `/check-access` verifier into boma's governance.
**Architecture:** Source of truth is the committed design spec `docs/superpowers/specs/2026-06-09-operational-access-design.md`. Structured access facts live as declarative `access__*` data that renders `ACCESS.md` and drives `/check-access` (the access analogue of `VERIFY.md` + `/verify-service`). Work is split into **Tranche A — land now** (doctrine docs, the one firewall code change, the dormant `/check-access` command, governance wiring) and **Tranche B — build-pending on infra** (per-service `access__*` population, rendered `ACCESS.md` files, and `/check-access` *running*), which arrive with service roles and live hosts and require no action in this plan.
**Tech Stack:** Markdown ADRs/docs; Ansible role `base` (Jinja2 nftables template + `defaults/main.yml`); Molecule (Debian 13, render + `nft -c`, no apply) for the firewall test; Claude Code command file for `/check-access`.
---
## File structure
| File | Tranche | Responsibility |
|---|---|---|
| `docs/decisions/021-operational-access.md` | A | NEW — the doctrine (two layers, three-tier ladder, break-glass, `access__*` model, `/check-access`) |
| `docs/decisions/016-mesh-vpn.md` | A | MODIFY — reconcile: SSH on `wt0` **and** from `ubongo`'s LAN address |
| `docs/decisions/020-firewall.md` | A | MODIFY — guaranteed management plane gains the control-node SSH source |
| `docs/access/service-access-template.md` | A | NEW — the `ACCESS.md` record shape (rendered-from-data + prose tail) |
| `roles/base/defaults/main.yml` | A | MODIFY — add `base__firewall_control_addr` knob (default empty → no-op) |
| `roles/base/templates/nftables.conf.j2` | A | MODIFY — conditional management-plane SSH rule for the control address |
| `roles/base/molecule/default/converge.yml` | A | MODIFY — set the knob for the test |
| `roles/base/molecule/default/verify.yml` | A | MODIFY — assert the rendered rule |
| `.claude/commands/check-access.md` | A | NEW — the `/check-access` verifier command (dormant until infra exists) |
| `docs/security/service-checklist.md` | A | MODIFY — one new gate item |
| `docs/runbooks/new-role.md` | A | MODIFY — new step: write `ACCESS.md` (mirrors SECURITY/VERIFY steps) |
| `CLAUDE.md` | A | MODIFY — `ACCESS.md` in Role conventions; ADR-021 in Further reading |
| `STATUS.md` | A | MODIFY — new rows for the doctrine, the firewall source, `/check-access` |
| `docs/TODO.md` | A | MODIFY — mark 3.2 + 7.2 DECIDED → ADR-021 |
**Tranche B (no tasks here — captured for the record):** per-service `access__*` blocks + rendered `roles/<svc>/ACCESS.md` land when each service role is built (governed by the Tranche-A checklist + runbook); `/check-access` *running* lands when `ubongo` + staging + vault exist. Both are designed-now, build-pending — exactly like `/verify-service` under ADR-017.
---
## Tranche A — Land now
### Task 1: Write ADR-021
**Files:**
- Create: `docs/decisions/021-operational-access.md`
The ADR is the durable decision record derived from the committed spec
`docs/superpowers/specs/2026-06-09-operational-access-design.md`. Match the prose style and
heading shape of an existing ADR (read `docs/decisions/020-firewall.md` first). The ADR
**must** state these specifics — they are the parts easy to get wrong:
- **Doctrine sentence (verbatim):** *"Every host and every service guarantees at least one
documented, verifiable way in for operational troubleshooting — and the deploy that
creates it also records and proves it."*
- **Two layers:** host baseline (resolves TODO 7.2) + per-service record (resolves TODO 3.2).
- **Three-tier access ladder:** (1) `wt0` mesh SSH — primary, WireGuard-authenticated;
(2) LAN SSH from `ubongo` only — secondary, mesh-independent, source-IP-gated **plus**
keys-only + fail2ban; all other LAN hosts stay default-denied; (3) console — break-glass
per host class: cluster VMs → Proxmox serial/VNC console, `askari` → Hetzner
rescue/console, `ubongo` → local console; reachability-checked, never exercised.
- **Reconciliation, not weakening (state this explicitly):** ADR-016 already requires
Ansible to reach the fleet by LAN IP ("a mesh/coordinator outage never blocks on-LAN
runs"), which *requires* LAN SSH from `ubongo`; yet ADR-016 also said "SSH only on `wt0`"
and ADR-020's guaranteed management plane listed only `wt0`. ADR-021 resolves that latent
contradiction by making the control-node SSH allow explicit and adding it to the
guaranteed management plane. It does **not** weaken default-deny: exactly one extra
trusted source on the LAN.
- **Declarative `access__*` data model:** service-role defaults carry `access__service`,
`access__compose_project`, `access__compose_path`, `access__containers`,
`access__log.loki_labels`, and `access__api` (`enabled`, `base_url`, `firewall_ref`,
`auth.vault_ref`, `health_path`; or `enabled: false` + `reason`). **Invariant:**
`access__api` never opens a port — it `firewall_ref`s the `group_vars` firewall catalog;
ADR-020 stays the sole owner of exposure.
- **Rendered record:** `ACCESS.md` is rendered from that data + a prose tail (operational
notes / gotchas). First-class sibling of `SECURITY.md`/`VERIFY.md`.
- **`/check-access`:** the verifier that probes each declared path and reports which are
live; break-glass reachability-only; designed now, build-pending on infra.
- **Status / consequences:** what lands now vs build-pending (mirror this plan's split).
- [ ] **Step 1: Author the ADR**
Write `docs/decisions/021-operational-access.md` covering every bullet above, in the
house style of `docs/decisions/020-firewall.md` (problem → decision → layers/ladder →
data model → verifier → consequences). Open with a one-line title heading
`# ADR-021 — Operational access: documented, verifiable ways in`.
- [ ] **Step 2: Sanity-check internal links**
Run: `grep -n "ADR-01[67]\|ADR-020\|access__\|check-access\|ACCESS.md" docs/decisions/021-operational-access.md`
Expected: references to ADR-016, ADR-020, the `access__*` keys, `/check-access`, and
`ACCESS.md` all present.
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/021-operational-access.md
git commit -m "docs(access): add ADR-021 operational-access doctrine"
```
---
### Task 2: Reconcile ADR-016 and ADR-020
**Files:**
- Modify: `docs/decisions/016-mesh-vpn.md` (the "Host firewall" bullet, ~line 64-65)
- Modify: `docs/decisions/020-firewall.md` (the "Guaranteed management plane" bullet, ~line 42-45)
- [ ] **Step 1: Amend ADR-016's Host-firewall bullet**
Replace the existing bullet:
```markdown
- **Host firewall:** NetBird's `wt0` interface; `base` nftables allows inbound SSH
**only on `wt0`** (the ADR-015 pattern, fleet-wide).
```
with:
```markdown
- **Host firewall:** `base` nftables allows inbound SSH on NetBird's `wt0` interface
(primary, WireGuard-authenticated) **and** from `ubongo`'s LAN address (secondary,
mesh-independent — required by the LAN-IP recovery path below, so a mesh/coordinator
outage never blocks on-LAN SSH). All other LAN hosts remain default-denied. This makes
explicit the control-node SSH allow that the recovery model already implied; the access
doctrine and the three-tier access ladder live in **ADR-021**.
```
- [ ] **Step 2: Amend ADR-020's guaranteed-management-plane bullet**
Replace:
```markdown
- **Guaranteed management plane**: loopback, established/related, and `wt0` (NetBird,
ADR-016) for SSH + Ansible are always allowed, independent of the catalog, applied
atomically — a malformed or empty catalog can never lock out management. (ADR-016: SSH
is allowed only on `wt0`.)
```
with:
```markdown
- **Guaranteed management plane**: loopback, established/related, `wt0` (NetBird,
ADR-016), and SSH from the control node's LAN address (`base__firewall_control_addr`,
the `ssh-from-control` source) for SSH + Ansible are always allowed, independent of the
catalog, applied atomically — a malformed or empty catalog can never lock out
management. The control-node source is part of the guaranteed plane, not the service
catalog (it is management, not a service); see ADR-021 for the access doctrine.
```
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/016-mesh-vpn.md docs/decisions/020-firewall.md
git commit -m "docs(access): reconcile ADR-016/020 with control-node SSH source (ADR-021)"
```
---
### Task 3: The `ACCESS.md` record template
**Files:**
- Create: `docs/access/service-access-template.md`
Match the preamble convention of `docs/security/service-security-template.md` and
`docs/testing/service-verify-template.md` (a "copy this to `roles/<service>/ACCESS.md`"
preamble, then a `---`, then the record).
- [ ] **Step 1: Write the template**
Create `docs/access/service-access-template.md`:
```markdown
# Per-service operational-access record — template
Copy this file to `roles/<service>/ACCESS.md` when building a service role (ADR-021).
It is the per-service **operational-access record**: every documented, verifiable way in
for troubleshooting. The structured parts are **rendered from the role's `access__*`
data** (the single source of truth that also drives `/check-access`) — keep the data
authoritative and regenerate this file rather than hand-editing the tables. The prose
"Operational notes" tail is hand-written.
Delete this preamble in the copy and start from the heading below.
---
# Access — <service>
## Access paths
The mesh-reachable ways in, by tier (rendered from `access__*`):
| Tier | Path | Invocation |
|---|---|---|
| primary | `wt0` mesh SSH | `ssh <host>` (over the NetBird mesh) |
| secondary | LAN SSH from `ubongo` | `ssh <host>` (from the control node, LAN address) |
| — | container exec + compose | `docker compose -p <access__compose_project> -f <access__compose_path> ps` / `exec` |
| — | logs | Loki query for labels `<access__log.loki_labels>` (Grafana; ADR-018) |
| — | admin API | `curl -H 'Authorization: …(vault_ref)' <access__api.base_url><health_path>` — or `n/a` |
## Break-glass
Mesh-and-LAN-independent fallback for this host's class (recorded, not routine):
- <Proxmox serial/VNC console for cluster VMs · Hetzner rescue for `askari` · local console for `ubongo`>
## Operational notes
Prose the data can't capture — service quirks, "if X is wedged, do Y", ordering gotchas.
- <none yet>
```
- [ ] **Step 2: Commit**
```bash
git add docs/access/service-access-template.md
git commit -m "docs(access): add ACCESS.md service record template"
```
---
### Task 4: Add the control-node SSH source to the host firewall (TDD)
**Files:**
- Modify: `roles/base/defaults/main.yml`
- Modify: `roles/base/templates/nftables.conf.j2`
- Modify: `roles/base/molecule/default/converge.yml`
- Modify: `roles/base/molecule/default/verify.yml`
This is the only code in Tranche A. It adds an **optional** guaranteed-management-plane
allow for SSH from the control node's LAN address. Default empty ⇒ no rule rendered ⇒
no behaviour change until a real `ubongo` address is set in `group_vars` (build-pending).
Test path is the established one for this role: Molecule render + `nft -c` (no apply).
- [ ] **Step 1: Write the failing test — converge sets the knob, verify asserts the rule**
In `roles/base/molecule/default/converge.yml`, add the knob under `vars:` (alongside
`base__firewall_apply: false`):
```yaml
base__firewall_control_addr: 10.10.0.99 # test control-node LAN address
```
In `roles/base/molecule/default/verify.yml`, extend the "management plane" assert block's
`that:` list (the task asserting default-deny + `wt0` SSH) with:
```yaml
- "'ip saddr 10.10.0.99 tcp dport 22 accept' in nft"
```
- [ ] **Step 2: Run the test to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL — the verify assert "input chain is missing default-deny or the management
plane" fires, because the template does not yet render the control-address rule.
- [ ] **Step 3: Add the default knob**
In `roles/base/defaults/main.yml`, after the `base__firewall_mgmt_interface` line, add:
```yaml
base__firewall_control_addr: "" # control-node LAN address (ubongo); SSH allowed from it
# as the guaranteed-management-plane `ssh-from-control`
# source (ADR-021). Empty = no rule. Set in group_vars
# once ubongo exists.
```
- [ ] **Step 4: Render the rule in the template**
In `roles/base/templates/nftables.conf.j2`, immediately after the `wt0` SSH line (the
`iifname "{{ base__firewall_mgmt_interface }}" ...` line), add:
```jinja
{% if base__firewall_control_addr %}
ip saddr {{ base__firewall_control_addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endif %}
```
- [ ] **Step 5: Run the test to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — the rule `ip saddr 10.10.0.99 tcp dport 22 accept` renders, `nft -c`
syntax-check succeeds, and all prior assertions (default-deny, `wt0` SSH, zone rules,
drop-in hook) still pass.
- [ ] **Step 6: Lint**
Run: `make lint`
Expected: PASS (no tag/FQCN/yaml regressions).
- [ ] **Step 7: Commit**
```bash
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): add ssh-from-control management-plane source (ADR-021)"
```
---
### Task 5: Author the `/check-access` command (dormant until infra)
**Files:**
- Create: `.claude/commands/check-access.md`
Mirror the structure of `.claude/commands/verify-service.md` (a forward-looking command
with a hard Prerequisites gate). It does not run until `ubongo` + live/staging hosts +
vault exist; if a prerequisite is missing it must say so and stop.
- [ ] **Step 1: Write the command**
Create `.claude/commands/check-access.md`:
```markdown
Operational-access verification (ADR-021)
Probe every documented way in to a service or host from `ubongo` and report which paths
are live. Reads the target's `access__*` data (and host baseline), so the verifier and
`ACCESS.md` can never disagree. Argument: a service/role name or a host
(e.g. `/check-access photoprism`, `/check-access docker01`).
## Prerequisites (forward-looking — ADR-021 dependencies)
This skill cannot run until these exist; if any is missing, say so and stop — do not
improvise around it:
- `ubongo` reachable on the mesh **and** the LAN (it runs the probes).
- The target host/service is deployed (staging or production inventory).
- `roles/<name>/` carries `access__*` data (services) / the host baseline applies.
- Vault unlocked (`rbw unlocked`) for any token-authenticated API probe.
## Process
### Phase 0 — resolve the target
Resolve the argument to a host or a service role + its host. Load the `access__*` data
(service) or the host-baseline + break-glass record (host). State what you will probe.
### Phase 1 — probe each declared path
| Path | Probe | Green = |
|---|---|---|
| `wt0` mesh SSH | connect over the mesh, run `true` | reachable + key works |
| LAN SSH from `ubongo` | connect via the LAN address, run `true` | reachable + key works |
| exec + compose | `docker compose -p <project> ps`; exec `true` in each `access__containers` entry | stack up, exec works |
| logs | query Loki for `access__log.loki_labels`, expect recent lines | logs flowing |
| admin API | `curl` `access__api.health_path` with the token from `access__api.auth.vault_ref` | 2xx |
| break-glass | reachability of the Proxmox/provider console endpoint **only** | console host reachable |
Break-glass is **never exercised** — firing a serial console is invasive; confirm the
fallback exists, do not drive it.
### Phase 2 — report
Emit a pass/fail table. For any red path, name it and the likely cause (e.g. "API token
in vault stale", "Alloy not shipping", "`base__firewall_control_addr` unset → no
`ssh-from-control` rule"). Verdict line: e.g. "3/4 paths green; admin API red".
## Notes
- Read-only and non-destructive — probes confirm reachability, they do not change state.
- This is the access analogue of `/verify-service` (ADR-017): designed now, runs when the
control node + hosts exist.
```
- [ ] **Step 2: Commit**
```bash
git add .claude/commands/check-access.md
git commit -m "feat(access): add /check-access verifier command (ADR-021, dormant)"
```
---
### Task 6: Governance wiring — checklist + runbook
**Files:**
- Modify: `docs/security/service-checklist.md` (the "Operability (security-adjacent)" section)
- Modify: `docs/runbooks/new-role.md` (after step 10, the VERIFY.md step)
ACCESS.md mirrors how SECURITY.md/VERIFY.md are enforced: a manual runbook step + a
checklist gate (the scaffold does not auto-drop SECURITY/VERIFY today either, so ACCESS
follows the same manual-copy pattern — no Makefile change).
- [ ] **Step 1: Add the checklist gate item**
In `docs/security/service-checklist.md`, under `## Operability (security-adjacent)`, add a
bullet after the `/verify-service` item:
```markdown
- [ ] Operational access recorded and verifiable (ADR-021): the role carries `access__*`
data, `roles/<service>/ACCESS.md` is rendered, and `/check-access` reports the
documented paths green — or a deviation is recorded in
`docs/security/accepted-risks.md`
```
- [ ] **Step 2: Add the runbook step**
In `docs/runbooks/new-role.md`, insert a new step between step 10 (VERIFY.md) and the
final commit step, and renumber the commit step to 12:
```markdown
### 11. Write the per-service operational-access record (services)
For a **service** role, copy `docs/access/service-access-template.md` to
`roles/<rolename>/ACCESS.md` and populate the role's `access__*` data
(`access__service`, `access__compose_project`/`_path`, `access__containers`,
`access__log.loki_labels`, and `access__api``enabled` + endpoint + `firewall_ref` +
`auth.vault_ref` + `health_path`, or `enabled: false` with a reason). `ACCESS.md` is
rendered from that data; the admin-API path must `firewall_ref` an entry in the
`group_vars` firewall catalog, never open a port itself (ADR-020/021). Once hosts exist,
`/check-access <rolename>` proves the documented paths are live — part of the
service-clearance gate (`docs/security/service-checklist.md`).
```
- [ ] **Step 3: Verify renumbering**
Run: `grep -n "^### 1[12]\." docs/runbooks/new-role.md`
Expected: `### 11. Write the per-service operational-access record` and `### 12. Commit`.
- [ ] **Step 4: Commit**
```bash
git add docs/security/service-checklist.md docs/runbooks/new-role.md
git commit -m "docs(access): gate ACCESS.md in checklist + new-role runbook (ADR-021)"
```
---
### Task 7: Index wiring — CLAUDE.md, STATUS.md, TODO.md
**Files:**
- Modify: `CLAUDE.md` (Role conventions list + Further reading table)
- Modify: `STATUS.md` (Designed-but-not-built table)
- Modify: `docs/TODO.md` (items 3.2 and 7.2)
- [ ] **Step 1: CLAUDE.md — Role conventions**
In the `## Role conventions` list, after the `VERIFY.md` bullet
("Every **service** role must have a populated `VERIFY.md` ..."), add:
```markdown
- Every **service** role must have a populated `ACCESS.md` (ADR-021) — copy
`docs/access/service-access-template.md`; rendered from the role's `access__*` data
```
- [ ] **Step 2: CLAUDE.md — Further reading**
In the Further reading table, after the Firewall strategy row, add:
```markdown
| Operational access | `docs/decisions/021-operational-access.md` |
```
- [ ] **Step 3: STATUS.md — new rows**
In the `## Designed but not built` table, add:
```markdown
| Operational-access doctrine (ADR-021) | ADR-021 | **Design RESOLVED** (ADR-021 + spec + plan). Two-layer doctrine, three-tier access ladder, `access__*` model, `ACCESS.md` record, `/check-access`. Reconciles ADR-016/020 SSH. |
| `ssh-from-control` firewall source | ADR-021 / ADR-020 | **Built (dormant).** `base__firewall_control_addr` knob + nftables rule + Molecule assertion landed; empty default = no rule until `ubongo`'s LAN address is set in `group_vars`. |
| `/check-access` verifier | ADR-021 | **Design RESOLVED** (`.claude/commands/check-access.md` authored). **Build pending:** running needs `ubongo` + live/staging hosts + vault. Access analogue of `/verify-service` (ADR-017). |
| Per-service `ACCESS.md` records | ADR-021 | Template + governance present; per-service files render when each service role is built. |
```
- [ ] **Step 4: docs/TODO.md — mark 3.2 and 7.2 DECIDED**
In `docs/TODO.md`, change item **3.2** from:
```markdown
2. Decide how to manage APIs / API access.
```
to:
```markdown
2. ~~Decide how to manage APIs / API access.~~ DECIDED (ADR-021): per-service `access__*`
data declares the admin API (endpoint + `firewall_ref` to the catalog + vault token
ref + health path); rendered into `ACCESS.md` and probed by `/check-access`. Part of
the two-layer operational-access doctrine.
```
And change item **7.2** from:
```markdown
2. Decide what to set up on the hosts, given that direct access will be rare.
```
to:
```markdown
2. ~~Decide what to set up on the hosts, given that direct access will be rare.~~
DECIDED (ADR-021): the host-layer access baseline — SSH on `wt0` + from `ubongo`,
Docker/Compose tooling, Alloy log shipping, and a recorded break-glass console per
host class.
```
- [ ] **Step 5: Verify and commit**
Run: `grep -n "021-operational-access\|ACCESS.md\|ssh-from-control" CLAUDE.md STATUS.md`
Expected: the new Role-conventions bullet, the Further-reading row, and the STATUS rows
are present.
```bash
git add CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(access): wire ADR-021 into CLAUDE.md, STATUS, TODO"
```
---
## Tranche B — Build-pending on infra (no tasks now)
Recorded so the boundary is explicit; nothing here is actioned by this plan.
- **Per-service `access__*` + rendered `ACCESS.md`** — authored when each service role is
built, governed by the Task 6 checklist item + runbook step. The first real service role
is where this first runs.
- **`/check-access` running** — needs `ubongo` + a live/staging host + vault. The command
(Task 5) already gates on these and stops cleanly until then.
- **Real `base__firewall_control_addr` value** — set in `group_vars/all` to `ubongo`'s LAN
address once `ubongo` is in inventory; the machinery + test landed in Task 4.
---
## Self-review
**Spec coverage:** doctrine + two layers → Task 1; three-tier ladder + ADR-016/020
reconciliation → Tasks 12, 4; `access__*` model + invariant → Tasks 1, 3, 6; rendered
`ACCESS.md` → Task 3; `/check-access` → Task 5; governance (checklist/runbook) → Task 6;
repo wiring (CLAUDE/STATUS/TODO) → Task 7; build-now vs build-pending split → Tranches
A/B. All spec sections map to a task.
**Deviations from the spec (deliberate, flagged for the user):**
1. The spec called `ssh-from-control` a *catalog* source; the plan places it in the
*guaranteed management plane* (`base__firewall_control_addr`) instead — ADR-020 already
houses SSH/Ansible management allows there, independent of the catalog, and the spec's
own invariant says the catalog owns *service* exposure only. Same intent, correct home.
2. The spec said `make new-role` would *scaffold* an `ACCESS.md` stub; the plan instead adds
a manual runbook step (Task 6) mirroring how `SECURITY.md`/`VERIFY.md` are handled today
(also manual copies, not scaffolded). Avoids unilaterally restructuring the scaffold;
the "can't be forgotten" intent is met by the checklist gate + runbook step.
**Type/name consistency:** `base__firewall_control_addr` (knob), `access__service` /
`access__compose_project` / `access__compose_path` / `access__containers` /
`access__log.loki_labels` / `access__api.{enabled,base_url,firewall_ref,auth.vault_ref,health_path}`
are used identically across Tasks 1, 3, 5, 6. The rendered nftables rule string
`ip saddr <addr> tcp dport 22 accept` matches between Task 4's template (Step 4) and its
assertion (Step 1).

View file

@ -0,0 +1,556 @@
# ADR Structure & Lifecycle Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Codify how boma's ADRs are structured — a canonical section set, an Accepted/Superseded/Deprecated lifecycle, a template, a lightweight enforcement check, and a one-time Status backfill of the back-catalogue.
**Architecture:** Five independent units. (1) A pure-function `adr-structure` check added to the existing `scripts/repo-scan.py` (stdlib only, pytest-tested like its siblings), verifying every numbered ADR has the four mandatory sections and a parseable Status line — presence only, not order. (2) An `adr-template.md` scaffold. (3) ADR-023 itself, written to pass its own check. (4) Wiring into CLAUDE.md and the `/review-repo` command doc. (5) A mechanical backfill adding `## Status` to ADRs 001018, dated from each file's first git-commit.
**Tech Stack:** Python 3 stdlib (`scripts/repo-scan.py`), pytest (`.venv/bin/pytest`), Markdown, git.
**Spec:** `docs/superpowers/specs/2026-06-10-adr-structure-design.md`
**Branch:** `feat/adr-structure` (already created; the design spec is the first commit).
**Convention reminders (from CLAUDE.md):** docs-/script-only commits skip the ansible-lint pre-commit hook and need no `rbw` unlock. Imperative subject ≤72 chars. `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>` trailer on every commit.
---
## Decisions locked by the spec (do not re-litigate)
- **Mandatory sections, in this order:** `## Status`, `## Context`, `## Decision`, `## Consequences`.
- **Optional sections:** `## Related`, `## Scope`, `## Guardrails` / `## Enforcement`, `## What was ruled out`, `## Verified facts (ADR-014)`.
- **Status lifecycle (4 states):** `Proposed (YYYY-MM-DD)` (genuine drafts, e.g. ADR-011) → `Accepted (YYYY-MM-DD)` (the common starting state) → optionally `Superseded by ADR-NNN (YYYY-MM-DD)` or `Deprecated (YYYY-MM-DD)`. (`Proposed` was added on the evidence of ADR-011, which is a real draft with open questions.)
- **No silent rewrites:** material reversal = new ADR + `Superseded by` marker; bidirectional link.
- **Enforcement checks presence + parseable Status line, NOT section order.** Order is demonstrated by the template, not machine-enforced.
- **Back-catalogue is fully restructured (no grandfathering)** — ADRs 001018 are brought to all-four-section conformance. The restructure is **presentational**: relabel/regroup/demote existing headings, add a dated Status, assemble a Consequences section from implications the ADR already states. **The substance of no decision is changed.** If a faithful Consequences cannot be drawn from existing content, escalate that file rather than inventing one.
---
## Task 1: `adr-structure` check in repo-scan.py
**Files:**
- Modify: `scripts/repo-scan.py` (add module-level regexes near the other `_RE` definitions ~line 3844; add `adr_structure_findings()` next to `deferred_findings()` ~line 96; wire it into `scan()` at the `findings.extend(...)` site ~line 215)
- Test: `tests/test_repo_scan.py` (new)
- [ ] **Step 1: Write the failing test**
Create `tests/test_repo_scan.py`:
```python
import importlib.util
import pathlib
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "repo-scan.py"
_spec = importlib.util.spec_from_file_location("repo_scan", _PATH)
rs = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(rs)
GOOD = [
"# ADR-099 — Example\n", "\n",
"## Status\n", "\n", "Accepted (2026-06-10)\n", "\n",
"## Context\n", "\n", "Why.\n", "\n",
"## Decision\n", "\n", "What.\n", "\n",
"## Consequences\n", "\n", "So what.\n",
]
def _checks(findings):
return [f for f in findings if f["check"] == "adr-structure"]
def test_good_adr_has_no_findings():
out = rs.adr_structure_findings({"docs/decisions/099-example.md": GOOD})
assert _checks(out) == []
def test_missing_mandatory_section_is_flagged():
lines = [ln for ln in GOOD if not ln.startswith("## Consequences")]
out = _checks(rs.adr_structure_findings({"docs/decisions/099-example.md": lines}))
assert len(out) == 1
assert "Consequences" in out[0]["detail"]
def test_unparseable_status_is_flagged():
lines = [("Designed, not built.\n" if ln == "Accepted (2026-06-10)\n" else ln)
for ln in GOOD]
out = _checks(rs.adr_structure_findings({"docs/decisions/099-example.md": lines}))
assert len(out) == 1
assert "Status not parseable" in out[0]["detail"]
def test_superseded_status_is_accepted():
lines = [("Superseded by ADR-100 (2026-06-11)\n" if ln == "Accepted (2026-06-10)\n"
else ln) for ln in GOOD]
out = _checks(rs.adr_structure_findings({"docs/decisions/099-example.md": lines}))
assert out == []
def test_non_numbered_file_is_skipped():
bare = ["# ADR template\n", "\n", "## Status\n", "\n", "<!-- hint -->\n"]
out = _checks(rs.adr_structure_findings({"docs/decisions/adr-template.md": bare}))
assert out == []
```
- [ ] **Step 2: Run the test to verify it fails**
Run: `.venv/bin/pytest tests/test_repo_scan.py -q`
Expected: FAIL — `AttributeError: module 'repo_scan' has no attribute 'adr_structure_findings'`.
- [ ] **Step 3: Add the regexes**
In `scripts/repo-scan.py`, after the `RESOLVE_WORD_RE = ...` line (~line 44), add:
```python
# ADR-structure check (ADR-023): numbered ADRs must carry the four mandatory
# sections and a parseable Status line. Presence only — section ORDER is a
# template-demonstrated convention, not machine-enforced.
ADR_FILE_RE = re.compile(r"^\d{3}-.*\.md$")
ADR_REQUIRED_SECTIONS = ("Status", "Context", "Decision", "Consequences")
ADR_STATUS_LINE_RE = re.compile(
r"^(Accepted \(\d{4}-\d{2}-\d{2}\)"
r"|Superseded by ADR-\d{3}"
r"|Deprecated \(\d{4}-\d{2}-\d{2}\))")
```
- [ ] **Step 4: Add the check function**
In `scripts/repo-scan.py`, immediately after the `deferred_findings(...)` function (it ends ~line 96, just before `def walk_files():`), add:
```python
def adr_structure_findings(adr_files):
"""adr_files: {rel_path: [lines]} for docs/decisions/*.md.
Flags numbered ADRs (NNN-*.md) missing a mandatory section or whose Status
section has no parseable lifecycle line. Non-numbered files (e.g.
adr-template.md) are skipped. Section order is NOT checked (ADR-023)."""
out = []
for rpath, lines in sorted(adr_files.items()):
if not ADR_FILE_RE.match(os.path.basename(rpath)):
continue
headings = {}
for i, line in enumerate(lines):
m = re.match(r"^##\s+(\w+)", line)
if m:
headings.setdefault(m.group(1), i)
missing = [s for s in ADR_REQUIRED_SECTIONS if s not in headings]
if missing:
out.append({"check": "adr-structure", "severity": "medium",
"path": rpath, "line": 1,
"detail": f"missing mandatory section(s): {', '.join(missing)}"})
if "Status" in headings:
body = []
for line in lines[headings["Status"] + 1:]:
if line.startswith("## "):
break
body.append(line)
status_text = next((ln.strip() for ln in body if ln.strip()), "")
if not ADR_STATUS_LINE_RE.match(status_text):
out.append({"check": "adr-structure", "severity": "medium",
"path": rpath, "line": headings["Status"] + 1,
"detail": "Status not parseable (want 'Accepted (YYYY-MM-DD)', "
"'Superseded by ADR-NNN', or 'Deprecated (YYYY-MM-DD)'); "
f"got: {status_text[:60]!r}"})
return out
```
- [ ] **Step 5: Run the test to verify it passes**
Run: `.venv/bin/pytest tests/test_repo_scan.py -q`
Expected: PASS — 5 passed.
- [ ] **Step 6: Wire the check into `scan()`**
In `scripts/repo-scan.py`, find (~line 215):
```python
findings.extend(deferred_findings(adr_files, defer_refs))
return findings
```
Replace with:
```python
findings.extend(deferred_findings(adr_files, defer_refs))
findings.extend(adr_structure_findings(adr_files))
return findings
```
- [ ] **Step 7: Confirm the check fires on the real (not-yet-backfilled) repo**
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print(sorted({f['path'] for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure'}))"`
Expected: a list including `docs/decisions/001-architecture.md` … through `018-logging.md` (001015 missing Status; 016018 unparseable Status). 019022 and 023 must NOT appear. This proves the check works and previews Task 5's worklist.
- [ ] **Step 8: Commit**
```bash
git add scripts/repo-scan.py tests/test_repo_scan.py
git commit -m "feat(review): add adr-structure check to repo-scan
Flags numbered ADRs missing a mandatory section (Status/Context/Decision/
Consequences) or with an unparseable Status line. Presence only, not order.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Task 2: ADR template
**Files:**
- Create: `docs/decisions/adr-template.md`
- [ ] **Step 1: Write the template**
Create `docs/decisions/adr-template.md` with exactly:
```markdown
# ADR-NNN — <Title>: <optional clarifying subtitle>
<!-- Filename: NNN-kebab-title.md (zero-padded, monotonic, never reused).
Register a row in CLAUDE.md "Further reading" when this ADR is created.
Sections below in order. Mandatory: Status, Context, Decision, Consequences.
Delete this comment and any optional section you don't use. -->
## Status
Accepted (YYYY-MM-DD)
<!-- Lifecycle: "Accepted (YYYY-MM-DD)" → later "Superseded by ADR-NNN (YYYY-MM-DD)"
or "Deprecated (YYYY-MM-DD)" + one-line why. Optional trailing note OK, e.g.
"Accepted (2026-06-10). Doctrine ADR — pins policy, builds nothing yet." -->
## Context
<!-- The forces, the problem, what exists today, why now. -->
## Decision
<!-- What we are doing. Use numbered sub-decisions (### 1. ...) for multi-part ADRs. -->
## Consequences
<!-- Results, trade-offs explicitly accepted, follow-on work. -->
<!-- Optional sections — uncomment any that genuinely apply; never pad:
## Scope — explicit in / out-of-scope boundaries.
## Guardrails — how the decision is mechanically enforced (lint, CI, hooks).
## What was ruled out — rejected alternatives, each with its reason.
## Verified facts (ADR-014) — verified: <subject> · <tool> <version> · <source> · <YYYY-MM-DD>
## Related — links to other ADRs by number; bidirectional for Supersedes/Superseded-by.
-->
```
(HTML comments do not nest — optional sections use one flat comment block with inline
em-dash descriptions, not commented sub-hints inside an outer comment.)
- [ ] **Step 2: Confirm the template is skipped by the check**
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print([f for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure' and 'adr-template' in f['path']])"`
Expected: `[]` (non-numbered filename → skipped).
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/adr-template.md
git commit -m "docs(adr): add adr-template.md scaffold (ADR-023)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Task 3: ADR-023 itself
**Files:**
- Create: `docs/decisions/023-adr-structure.md`
- [ ] **Step 1: Write ADR-023**
Create `docs/decisions/023-adr-structure.md`. It must pass its own check (Status/Context/Decision/Consequences present; parseable Status line). Use this content:
```markdown
# ADR-023 — ADR structure & lifecycle
## Status
Accepted (2026-06-10). Meta/doctrine ADR — pins how ADRs are written; the
`adr-structure` check (`scripts/repo-scan.py`) and `docs/decisions/adr-template.md`
ship with it, and ADRs 001018 were retroactively restructured to conform. Resolves
the FRICTION signal (2026-05-31) about ADR-writing policy being unsettled.
## Context
boma records architectural decisions as numbered ADRs in `docs/decisions/`, and
CLAUDE.md treats them as load-bearing. Yet no ADR said how an ADR is written. The
newest ADRs (019022) converged on a clean shape — Status → Context → Decision →
Consequences → Related — but only by imitation. ADRs 001018 predate it and drifted
widely: most lacked a `## Status` section entirely (016018 carried only a trailing
build-state note), and many lacked an explicit `## Decision` or `## Consequences`
heading, their decisions spread across ad-hoc topical sections. The result was
structural drift and no uniform way to tell an active decision from a superseded or
deprecated one.
## Decision
### 1. Title & filename
Title line: `# ADR-NNN — <Title>: <optional clarifying subtitle>` (em-dash). Filename:
`NNN-kebab-title.md`, zero-padded 3-digit, monotonic, never reused — a superseded ADR
keeps its number and file. A new ADR is registered as a row in the CLAUDE.md
"Further reading" table.
### 2. Mandatory sections, in this order
- `## Status` — a lifecycle line, usually `Accepted (YYYY-MM-DD)` (see §4), plus an
optional one-line note.
- `## Context` — the forces, the problem, what exists today, why now.
- `## Decision` — what we are doing; numbered sub-decisions for multi-part ADRs.
- `## Consequences` — results, trade-offs explicitly accepted, follow-on work.
### 3. Optional sections (use only where they genuinely apply)
`## Related`, `## Scope`, `## Guardrails` / `## Enforcement`, `## What was ruled out`,
`## Verified facts (ADR-014)`.
### 4. Status lifecycle
Four states. Because boma is single-contributor and trunk-based with no review gate,
most ADRs are **born `Accepted (YYYY-MM-DD)`** — committed-to on writing. A
**`Proposed`** state exists for a genuine draft whose core direction is recorded but
whose specifics are still open for discussion (e.g. ADR-011); it is promoted to
`Accepted` once settled.
- **`Proposed (YYYY-MM-DD)`** — drafted, under discussion, not yet committed-to. May
carry open questions. Promoted to `Accepted (YYYY-MM-DD)` when decided.
- **`Accepted (YYYY-MM-DD)`** — committed-to. The common starting state.
- Replaced → old ADR's Status becomes **`Superseded by ADR-NNN (YYYY-MM-DD)`**; the new
ADR records `Supersedes ADR-MMM` in its Status and `## Related`. The link is
**bidirectional**.
- Retired with no replacement → **`Deprecated (YYYY-MM-DD)`** + a one-line reason.
**No silent rewrites.** An Accepted ADR is not edited to reverse its decision. Typo and
clarity fixes are fine; a material reversal requires a new ADR and a `Superseded by`
marker on the old one.
### 5. Template & enforcement
`docs/decisions/adr-template.md` is the scaffold for new ADRs. The `/review-repo`
command's pre-scan (`scripts/repo-scan.py`) emits an `adr-structure` finding for any
numbered ADR missing a mandatory section or with an unparseable Status line. It checks
**presence and Status, not section order** — order is a convention the template carries,
deliberately not gated, to keep enforcement lightweight (consistent with boma's other
doctrine ADRs adding no CI gate).
### 6. Retroactive conformance of the back-catalogue
ADRs 001018 are restructured to satisfy this standard rather than grandfathered. The
restructure is **presentational** — existing headings are relabelled, regrouped, or
demoted under a `## Decision` umbrella; a dated `## Status` is added; a `## Consequences`
section is assembled from implications the ADR already states. **The substance of no
decision is changed.** This keeps the check uniform (no number threshold) and the corpus
a consistent, legible decision history.
## Consequences
- New ADRs have one obvious shape and a scaffold; structural drift stops.
- Every ADR declares its lifecycle state uniformly, and reversals are traceable.
- The whole corpus conforms; the check needs no grandfathering and stays simple.
- One-time restructure churn across ADRs 001018 (heading reorganization + a Status and
a Consequences section per file; no decision substance changed).
- `/review-repo` grows one deterministic check; no new CI machinery.
- This ADR is the first conformant example and is held to its own check.
## What was ruled out
- **A `make lint` / CI gate for ADR structure** — heavier than the risk warrants;
the `/review-repo` check and the template suffice.
- **Machine-enforcing section order** — brittle for marginal value; left as a
template-demonstrated convention.
- **Grandfathering 001018 from the check** — rejected in favour of restructuring the
whole corpus to conform, so the standard applies uniformly with no exceptions.
## Related
- ADR-014 — knowledge sourcing (the `Verified facts` optional section).
- ADR-019/020/021/022 — the emergent structure this ADR codifies.
- `docs/decisions/adr-template.md` — the scaffold.
- `scripts/repo-scan.py` — the `adr-structure` enforcement check.
```
- [ ] **Step 2: Confirm ADR-023 passes its own check**
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print([f for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure' and '023-' in f['path']])"`
Expected: `[]`.
- [ ] **Step 3: Commit**
```bash
git add docs/decisions/023-adr-structure.md
git commit -m "docs(adr): ADR-023 — ADR structure & lifecycle
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Task 4: Wire into CLAUDE.md and the review-repo command doc
**Files:**
- Modify: `CLAUDE.md` ("Further reading" table)
- Modify: `.claude/commands/review-repo.md` (the deterministic-findings description, ~line 2628)
- [ ] **Step 1: Add the CLAUDE.md "Further reading" row**
In `CLAUDE.md`, in the "Further reading" table, after the `Backup & disaster recovery` row, add:
```markdown
| ADR structure & lifecycle | `docs/decisions/023-adr-structure.md` |
```
- [ ] **Step 2: Mention the new check in review-repo.md**
In `.claude/commands/review-repo.md`, find (~line 2728):
```markdown
(roles, ADRs, runbooks, playbooks, scripts — your shard list) and **exact findings**
(markers, broken refs, unencrypted vaults). Fold these into the report verbatim.
```
Replace the parenthetical with:
```markdown
(roles, ADRs, runbooks, playbooks, scripts — your shard list) and **exact findings**
(markers, broken refs, unencrypted vaults, ADR-structure violations). Fold these into
the report verbatim.
```
- [ ] **Step 3: Verify the CLAUDE.md link resolves**
Run: `test -f docs/decisions/023-adr-structure.md && echo OK`
Expected: `OK`.
- [ ] **Step 4: Commit**
```bash
git add CLAUDE.md .claude/commands/review-repo.md
git commit -m "docs(adr): register ADR-023 and note adr-structure check
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Task 5: Retroactively restructure ADRs 001018 to full conformance
**Goal:** every ADR in 001018 ends with all four mandatory sections present and a
parseable Status line, so the `adr-structure` check reports zero findings — **without
changing the substance of any decision.**
**Files (current findings — the exact worklist):**
- Missing `Status` + `Consequences`: `001-architecture.md`, `002-security.md`, `004-docker-model.md`, `005-bootstrapping.md`, `014-knowledge-sourcing.md`
- Missing `Status` + `Decision` + `Consequences`: `006-terraform.md`, `007-network.md`, `008-testing.md`, `009-provisioning-handoff.md`, `010-forgejo-ci.md`, `011-update-management.md`
- Missing all four: `003-toolchain.md`
- Missing `Status` + `Decision`: `013-heritage-v4.md`
- Missing `Status` only: `012-hardware-capacity.md`, `015-control-host.md`
- Have unparseable `Status` + missing `Consequences`: `016-mesh-vpn.md`, `017-service-ui-verification.md`, `018-logging.md`
(`010`/`011` use `## Decisions` (plural) → relabel to `## Decision`. The "missing
Decision" cases generally have the decision spread across topical `##` headings.)
**THE FAITHFULNESS RULE (non-negotiable):** This is a *presentational* restructure.
You MAY: add a `## Status` section; relabel a heading (`## Decisions``## Decision`);
introduce a `## Decision` umbrella heading and **demote** existing topical `##` headings
to `###` beneath it; add a `## Consequences` section. You MUST NOT alter any existing
sentence of decision prose, reword arguments, or add new policy. A `## Consequences`
section is assembled **only** from implications the ADR already states (its trade-offs,
"what was ruled out", "open questions", named follow-on work). **If an ADR states
nothing that can be faithfully cast as a consequence, STOP and report it as
DONE_WITH_CONCERNS / escalate — do not invent consequences.**
**Per-file date source:** the file's first git-commit (add) date —
`git log --diff-filter=A --format=%as -- <path> | tail -1` (yields `YYYY-MM-DD`).
- [ ] **Step 1: Add a dated `## Status` section to each ADR**
For 001015 (no Status today): insert, between the title line and the first `##`
heading, a Status section:
```markdown
## Status
Accepted (<d>)
```
where `<d>` is the file's first-git-commit date. For 016/017/018 (unparseable Status
today): prepend a parseable `Accepted (<d>). ` clause to the first line of their
existing `## Status` section so the build-state note becomes its tail, e.g.
`Accepted (2026-06-05). Designed. **Authorable now:** ...`.
- [ ] **Step 2: Ensure a `## Decision` section exists**
For ADRs flagged "missing Decision" (003, 006, 007, 008, 009, 010, 011, 013): relabel a
plural/synonym heading where one exists (`## Decisions``## Decision` in 010/011), or
introduce a `## Decision` umbrella immediately after `## Context` and demote the existing
topical `##` body headings (e.g. in 003: "Execution engine", "Python environment", …) to
`###`. Do not move or rewrite the prose under them.
- [ ] **Step 3: Ensure a `## Consequences` section exists**
For every ADR flagged "missing Consequences" (001, 002, 003, 004, 005, 006, 007, 008,
009, 010, 011, 014, 016, 017, 018): add a `## Consequences` section near the end,
assembled strictly from implications the ADR already states. Where an ADR has a trailing
section that *is* consequences under another name (e.g. "What was ruled out", "Open
questions", "Trade-offs"), you may keep that section and add a short `## Consequences`
that references/summarizes the already-stated trade-offs — without introducing new
claims. **Honour the faithfulness rule; escalate any ADR where no faithful Consequences
can be drawn.**
- [ ] **Step 4: Verify the whole corpus passes the check**
Run: `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; v=[f for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure']; print('adr-structure findings:', len(v)); [print(' ', f['path'], '—', f['detail']) for f in v]"`
Expected: `adr-structure findings: 0`.
- [ ] **Step 5: Verify faithfulness via diff**
Run: `git diff --stat` and spot-check `git diff docs/decisions/003-toolchain.md`.
Expected: changes are heading additions/relabels/level-demotions, a new Status section,
and a new Consequences section — **no edits to existing decision sentences.**
- [ ] **Step 6: Run the repo-scan test suite**
Run: `.venv/bin/pytest tests/test_repo_scan.py -q`
Expected: PASS — 5 passed.
- [ ] **Step 7: Commit**
```bash
git add docs/decisions/0*.md docs/decisions/1*.md
git commit -m "docs(adr): restructure ADRs 001-018 to ADR-023 conformance
Presentational only: add a dated Status section, relabel/regroup headings
under Decision, and add a Consequences section assembled from each ADR's
already-stated implications. No decision substance changed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Final verification (after all tasks)
- [ ] **Lint:** `make lint` — Expected: passes (docs + a stdlib script touched; ansible content unchanged).
- [ ] **Full deterministic scan clean for our check:** `python3 scripts/repo-scan.py 2>/dev/null | python3 -c "import json,sys; print('adr-structure:', sum(1 for f in json.load(sys.stdin)['findings'] if f['check']=='adr-structure'))"``adr-structure: 0`.
- [ ] **Tests green:** `.venv/bin/pytest tests/ -q` → all pass.
- [ ] **Branch ready:** invoke `superpowers:finishing-a-development-branch` to merge `feat/adr-structure` to `main` (trunk-based, no PR) and delete the branch.
---
## Self-review notes
- **Spec coverage:** §1 title/filename → Task 3 + template; §2 sections → Tasks 2/3 + check; §3 lifecycle → Task 3; §4 cross-refs → Task 3 `## Related`; §5 template → Task 2; §6 retroactive restructure → Task 5; §7 enforcement → Task 1 + Task 4. All covered.
- **Order nuance:** spec says sections come "in this order"; the check enforces presence + Status only. This is intentional and stated in both the spec's enforcement wording ("the four mandatory sections and a parseable Status line") and ADR-023's Decision §5 / "What was ruled out". Not a gap.
- **Type/name consistency:** `adr_structure_findings` and the `"adr-structure"` check key are used identically in the function, the `scan()` wiring, the tests, and both verification one-liners.

View file

@ -0,0 +1,476 @@
# Backup & DR Strategy — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Land the *foundation layer* of the backup strategy — ADR-022, the per-service `backup__*` data contract + `BACKUP.md` governance triad (template + checklist gate + runbook step + dormant verifier), and the doc/inventory updates — so every future service role is born backup-aware, before any live infrastructure exists.
**Architecture:** This is the first of three sequenced plans (see *Decomposition & roadmap* below). It is **doc/governance only** — no Ansible role, no live restic/rclone, no host contact. It mirrors exactly how ADR-021 delivered operational-access governance: a template under `docs/<concern>/`, one line in `docs/security/service-checklist.md`, a step in `docs/runbooks/new-role.md`, and a *dormant* verifier command (`/check-access` → here `/check-backup`). boma deliberately gates these per-service docs via checklist+runbook, **not** an automated lint script — so this plan adds **no** `scripts/check-*.py`. (This reconciles the design doc's casual "make lint gates its presence" phrasing with boma's actual governance choice; the ADR records the reconciliation.)
**Tech Stack:** Markdown docs, Ansible role-var conventions (`backup__*`, double-underscore namespace per CLAUDE.md), `make lint` (yamllint + ansible-lint + `check-tags.py`) as the only automated gate, `git` trunk-based on a feature branch.
**Source spec:** `docs/superpowers/specs/2026-06-10-backup-strategy-design.md` (Decisions 113 referenced by number throughout).
---
## Decomposition & roadmap
The full spec spans three subsystems with hard ordering dependencies (STATUS.md: no service roles exist, `fisi` unprovisioned, Terraform never `init`ed, no staging cluster, no Uptime Kuma/pCloud). Each becomes its own plan and produces working, testable software on its own:
- **Plan 1 — Foundation (THIS PLAN).** ADR + `backup__*` contract + `BACKUP.md` governance + doc/inventory updates. Buildable and verifiable **today** with zero live infra. Unblocks every service role.
- **Plan 2 — The `backup` role (FUTURE).** `make new-role NAME=backup`: pull orchestrator, restic wrapper, `rclone→pCloud`, retention prune, udev air-gap unit + `restic copy`, systemd timers, ntfy + Uptime-Kuma heartbeat. Built with Molecule render/syntax tests + pytest, the way the `firewall` concern was — buildable now, *functionally* testable only once `fisi` + hosts exist. **Blocked on:** `fisi` provisioned (SATA power cable), `backup_hosts` inventory group, at least one service role declaring `backup__*`.
- **Plan 3 — Live wire-up + restore testing (FUTURE).** Deploy the role, pCloud rclone auth, Uptime Kuma push monitor, Tier-1 restore-verify on `ubongo`, semi-annual Tier-2 DR rehearsal on staging, the printed break-glass runbook + its annual drill. **Blocked on:** Plan 2 deployed, real VMs/staging, services with `VERIFY.md`, Vaultwarden live.
Write Plans 2 and 3 with this same skill when their prerequisites land. Everything below is Plan 1.
---
## Plan 1 file map
| File | Action | Responsibility |
|---|---|---|
| `docs/decisions/022-backup.md` | create | ADR of record; distils the spec's Decisions 113 |
| `docs/backup/service-backup-template.md` | create | `BACKUP.md` template; defines the `backup__*` contract shape |
| `.claude/commands/check-backup.md` | create | Dormant verifier (mirrors `check-access.md`) |
| `CLAUDE.md` | modify | Role-conventions: BACKUP.md required for service roles; Further-reading row |
| `docs/security/service-checklist.md` | modify | Strengthen the Operability backup line to the ADR-022 gate |
| `docs/runbooks/new-role.md` | modify | Add the per-service BACKUP.md step (new §12, renumber commit) |
| `docs/hardware/reference.md` | modify | `ubongo` → M70q/1TB; add `fisi` node + capacity row |
| `docs/CAPABILITIES.md` | modify | §9: restic+rclone+USB committed; PBS deferred; ref ADR-022 |
| `STATUS.md` | modify | Add "Designed but not built" rows for backup role + contract |
| `docs/TODO.md` | modify | Mark item 3.8 decided; reference ADR-022 |
**Working branch (all tasks):** AI-driven multi-file change → review as one diff (CLAUDE.md git conventions).
```bash
git checkout -b feat/backup-foundation
```
Before any commit, confirm `rbw unlocked` exits 0 (the pre-commit hook decrypts `vault.yml`); if not, stop and ask the operator to `rbw unlock`.
---
### Task 1: Author ADR-022 and wire the decision into CLAUDE.md / STATUS.md / TODO.md
**Files:**
- Create: `docs/decisions/022-backup.md`
- Modify: `CLAUDE.md` (Further-reading table; role-conventions block)
- Modify: `STATUS.md` ("Designed but not built" table)
- Modify: `docs/TODO.md` (item 3.8)
- [ ] **Step 1: Write `docs/decisions/022-backup.md`**
Mirror the structure of `docs/decisions/021-operational-access.md` (`## Context`, `## Decision`, subsections, `## Consequences`). Transcribe the spec's settled decisions — do not re-derive. The ADR body must state, each as its own labelled decision:
1. **Recovery model A** — data-only restic backups, rebuild-from-code; no PBS in v1 (deferred as Model B/C). (spec Decision 1)
2. **One tier, ~24 h RPO.** (Decision 2)
3. **Engine:** restic (data) + rclone (pCloud off-site); restic encrypts → rclone moves ciphertext only, no second layer. (Decision 3)
4. **Topology:** central off-cluster **pull** node (`fisi`, provisional), 2×8 TB mirror, owns the repo, runs rclone + the USB dock; hosts hold no backup creds. New `backup_hosts` inventory group, `base` role applies. (Decision 4)
5. **3-2-1 mapping** incl. USB air-gap as the immutable backstop. (Decision 5)
6. **Per-service contract:** `backup__*` role vars + required `BACKUP.md`, rendered from the data (the ADR-021 pattern). **Governance reconciliation:** gated via the per-service checklist + new-role runbook + dormant `/check-backup` verifier — **not** an automated lint script (consistent with ADR-021's "runbook+gate, not scaffold" choice). State this explicitly so it supersedes the design doc's "make lint gates its presence" wording. (Decision 6)
7. **Consistency:** logical dumps first (`pg_dump`/`mysqldump`), `quiesce` escape hatch; FS snapshots not the sole DB method. (Decision 7)
8. **Restore testing:** Tier-1 weekly rolling container restore-verify on `ubongo` (reuses `VERIFY.md`); Tier-2 semi-annual full DR rehearsal on staging, ≥1/yr exercises the paper break-glass. `ubongo` stays bare Debian, not a hypervisor (ADR-015 unchanged). (Decision 8)
9. **Retention (GFS):** `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`. (Decision 9)
10. **Encryption + escrow + break-glass:** one restic password protects all copies; escrowed to `fisi`(+vault) / Vaultwarden / **paper**; paper holds **both** the restic password **and** the Ansible vault password (breaks the Model-A circular dependency); `mamba` is the break-glass clone (ADR-015). (Decision 10)
11. **USB air-gap:** udev serial-allowlist → `restic copy` to a USB restic repo → `restic check` → ntfy; rotate off-site. (Decision 11)
12. **Failure alerting:** Uptime-Kuma dead-man's-switch + ntfy on failure + weekly `restic check`. (Decision 12)
13. **Schedule.** (Decision 13)
`## Consequences` must note: pCloud is off-site but **sync-coupled** (deletes propagate) → USB is the only immutable copy; `fisi` is the crown-jewel host (full base hardening); pCloud's 1 TB is the off-site capacity ceiling. End with a one-line pointer back to the design doc and to Plans 23 as the build path.
- [ ] **Step 2: Add the Further-reading row in `CLAUDE.md`**
In the Further-reading table, immediately after the `Operational access … 021-operational-access.md` row, add:
```
| Backup & disaster recovery | `docs/decisions/022-backup.md` |
```
- [ ] **Step 3: Add the BACKUP.md role-convention in `CLAUDE.md`**
In the "Role conventions" list, immediately after the `ACCESS.md (ADR-021)` bullet, add:
```
- Every **service** role that holds state must have a populated `BACKUP.md` (ADR-022) —
copy `docs/backup/service-backup-template.md`; rendered from the role's `backup__*`
data. A stateless service records `backup__state: false` with a reason.
```
- [ ] **Step 4: Add STATUS.md rows**
In the "Designed but not built" table in `STATUS.md`, add two rows:
```
| Backup `backup` role + `backup_hosts` group | ADR-022 | Does not exist. Pull node (`fisi`), restic repo, rclone→pCloud, USB air-gap — Plan 2. |
| Per-service `backup__*` contract + `BACKUP.md` | ADR-022 | Convention defined; inert until service roles exist to declare against. |
```
- [ ] **Step 5: Update TODO item 3.8**
In `docs/TODO.md`, change the item-3.8 line:
From:
```
8. Ensure the right things are backed up (incl. database dumps if we land on PBS).
```
To:
```
8. ~~Ensure the right things are backed up (incl. database dumps if we land on PBS).~~
DECIDED (ADR-022): data-only restic (Model A, no PBS) pulled by an off-cluster
node (`fisi`); per-service `backup__*` + `BACKUP.md`; logical DB dumps; 3-2-1 via
pCloud + rotated USB air-gap. Build: Plans 23.
```
- [ ] **Step 6: Verify**
Run: `make lint`
Expected: PASS (yamllint, ansible-lint, `check-tags: OK …`). No new YAML/tags introduced, so this confirms nothing regressed.
Run: `grep -n "022-backup" CLAUDE.md && grep -rn "ADR-022" docs/decisions/022-backup.md STATUS.md docs/TODO.md`
Expected: matches in every listed file (cross-references resolve).
- [ ] **Step 7: Commit**
```bash
git add docs/decisions/022-backup.md CLAUDE.md STATUS.md docs/TODO.md
git commit -m "docs(backup): record ADR-022; wire into CLAUDE.md, STATUS, TODO"
```
---
### Task 2: Create the `BACKUP.md` template and define the `backup__*` contract
**Files:**
- Create: `docs/backup/service-backup-template.md`
- [ ] **Step 1: Create the template**
Mirror `docs/access/service-access-template.md` (preamble that says copy-to-role-and-delete; structured tables rendered from data; a hand-written prose tail). Write exactly:
````markdown
# Per-service backup record — template
Copy this file to `roles/<service>/BACKUP.md` when building a **stateful** service
role (ADR-022). It is the per-service **backup record**: what state the service holds,
how it is captured consistently, and how it is restored. The structured parts are
**rendered from the role's `backup__*` data** (the single source of truth that also
drives `/check-backup`) — keep the data authoritative and regenerate this file rather
than hand-editing the tables. The prose "Restore notes" tail is hand-written.
A **stateless** service (holds no persistent data) does not get a `BACKUP.md`; it sets
`backup__state: false` with a reason in its role defaults instead.
Delete this preamble in the copy and start from the heading below.
---
# Backup — <service>
## State captured
Rendered from `backup__*`:
| What | Source | How captured |
|---|---|---|
| data dir(s) | `<backup__paths[*]>` | file-level, pulled read-only |
| database | `<backup__dumps[*].cmd>``<backup__dumps[*].dest>` | logical dump (default; ADR-022 Decision 7) |
- **Quiesce:** `<backup__quiesce>``true` means the service is stopped → backed up →
restarted (escape hatch for data that cannot be dumped live; ADR-022 Decision 7 B).
- **RPO:** ~24 h (nightly; ADR-022 Decision 2).
## Restore procedure
1. Re-provision the host (Terraform) and redeploy this role (Ansible) — Model A.
2. `restic restore` the latest snapshot for `<backup__service>` into `<backup__paths>`.
3. Replay each `<backup__dumps[*].dest>` into its database.
4. Confirm with this role's `VERIFY.md` checks (ADR-008/017).
## Restore notes
Prose the data can't capture — ordering gotchas, "restore the DB before the data dir",
known-tricky migrations.
- <none yet>
````
The `backup__*` contract this template renders from (document it here and in the ADR; the role in Plan 2 consumes it):
```yaml
backup__service: <name> # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs/files holding state ([] = none)
- /srv/<service>/data
backup__dumps: # logical app-consistent dumps (Decision 7 default; [] = none)
- cmd: "docker compose -p <service> exec -T db pg_dump -U {{ vault.<service>.db_user }} <db>"
dest: <service>-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch (Decision 7 B)
```
- [ ] **Step 2: Verify**
Run: `test -f docs/backup/service-backup-template.md && echo PRESENT`
Expected: `PRESENT`
Run: `make lint`
Expected: PASS (markdown only; confirms no regression).
- [ ] **Step 3: Commit**
```bash
git add docs/backup/service-backup-template.md
git commit -m "docs(backup): add BACKUP.md template + backup__* contract (ADR-022)"
```
---
### Task 3: Strengthen the per-service checklist gate
**Files:**
- Modify: `docs/security/service-checklist.md` (Operability section)
- [ ] **Step 1: Replace the weak backup line with the ADR-022 gate**
In the "Operability (security-adjacent)" section, replace this line:
```
- [ ] Backup/restore is covered if the service holds state
```
with (mirroring the existing ADR-021 access line directly below it):
```
- [ ] Backup/restore recorded and verifiable (ADR-022): a stateful service carries
`backup__*` data, `roles/<service>/BACKUP.md` is rendered, and `/check-backup`
reports the declared paths/dumps captured in the latest snapshot — or the service
sets `backup__state: false` with a reason. Deviations → `docs/security/accepted-risks.md`.
```
- [ ] **Step 2: Verify**
Run: `grep -n "ADR-022" docs/security/service-checklist.md`
Expected: one match (the new gate line).
Run: `grep -c "Backup/restore is covered if the service holds state" docs/security/service-checklist.md`
Expected: `0` (old weak line gone).
- [ ] **Step 3: Commit**
```bash
git add docs/security/service-checklist.md
git commit -m "docs(backup): gate BACKUP.md in service checklist (ADR-022)"
```
---
### Task 4: Add the BACKUP.md step to the new-role runbook
**Files:**
- Modify: `docs/runbooks/new-role.md` (insert a new step after the §11 ACCESS step; renumber the commit step)
- [ ] **Step 1: Insert the new step**
Immediately after the §11 "Write the per-service operational-access record" block and before "### 12. Commit", insert:
```markdown
### 12. Write the per-service backup record (stateful services)
For a **stateful** service role, copy `docs/backup/service-backup-template.md` to
`roles/<rolename>/BACKUP.md` and populate the role's `backup__*` data (`backup__service`,
`backup__paths`, `backup__dumps``cmd` + `dest` per logical dump — and `backup__quiesce`;
ADR-022). Prefer logical dumps (`pg_dump`/`mysqldump`) over file-level DB copies. `BACKUP.md`
is rendered from that data. A **stateless** service sets `backup__state: false` with a
reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rolename>`
proves the declared state is captured — part of the service-clearance gate
(`docs/security/service-checklist.md`).
```
- [ ] **Step 2: Renumber the commit step**
Change the heading `### 12. Commit` (now the following heading) to `### 13. Commit`.
- [ ] **Step 3: Verify**
Run: `grep -nE "^### (11|12|13)\." docs/runbooks/new-role.md`
Expected: §11 access, §12 backup, §13 commit — in that order, no duplicate numbers.
- [ ] **Step 4: Commit**
```bash
git add docs/runbooks/new-role.md
git commit -m "docs(backup): add BACKUP.md step to new-role runbook (ADR-022)"
```
---
### Task 5: Create the dormant `/check-backup` verifier command
**Files:**
- Create: `.claude/commands/check-backup.md`
- [ ] **Step 1: Write the command**
Mirror the sibling `.claude/commands/check-access.md` (same frontmatter/sections, same "dormant until infra exists" framing). Write:
````markdown
---
description: Backup-coverage verification (ADR-022) — proves a service's declared backup state is actually captured.
---
Verify that a service's **declared** backup data (`backup__*`) is actually captured in
the backup repo, so the verifier and `BACKUP.md` can never disagree (the ADR-021 pattern,
applied to backups). Argument: a service/role name (e.g. `/check-backup nextcloud`).
**Dormant until the backup node exists** (Plan 2/3): with no `fisi` repo to query, this
command reports `not-yet-available` rather than failing.
## Preconditions
- `roles/<name>/` carries `backup__*` data (or `backup__state: false` with a reason).
- The backup node (`fisi`) is reachable and its restic repo exists. If not → report
`not-yet-available` and stop.
## Checks (when live)
Load the `backup__*` data for the resolved role, then:
| Check | How | Green when |
|---|---|---|
| snapshot freshness | `restic snapshots --tag <backup__service> --latest 1` | a snapshot ≤ ~24 h old exists |
| paths present | the latest snapshot contains every `backup__paths` entry | all declared paths present |
| dumps present | the snapshot contains every `backup__dumps[*].dest` | all declared dumps present |
| integrity | `restic check --read-data-subset` (sampled) | no errors |
Report per-check pass/fail; a stateless role (`backup__state: false`) reports `n/a (stateless)`.
````
- [ ] **Step 2: Verify**
Run: `test -f .claude/commands/check-backup.md && head -1 .claude/commands/check-backup.md`
Expected: file present, first line `---` (valid frontmatter).
Run: `grep -n "not-yet-available" .claude/commands/check-backup.md`
Expected: matches (dormancy explicit).
- [ ] **Step 3: Commit**
```bash
git add .claude/commands/check-backup.md
git commit -m "feat(backup): add dormant /check-backup verifier (ADR-022)"
```
---
### Task 6: Update hardware reference and capabilities
**Files:**
- Modify: `docs/hardware/reference.md` (`ubongo` spec; new `fisi` node; capacity table)
- Modify: `docs/CAPABILITIES.md` (§9 Data & backup)
- [ ] **Step 1: Update the `ubongo` prose block**
In `docs/hardware/reference.md` §1, replace the `ubongo` Storage line target with the real machine:
From:
```
- **Storage:** _TBD (target 250 GB SSD/NVMe)_
```
To:
```
- **Storage:** 1 TB NVMe (ThinkCentre M70q Tiny; i3-10100T, 16 GB) — over-spec for Tier-1 restore-verify (ADR-022)
```
- [ ] **Step 2: Add a `fisi` prose block**
After the `ubongo` block in §1, add:
```
### fisi (backup node — outside the cluster; provisional)
- **Model / form factor:** HP Elite 600 G9 (tower)
- **CPU:** i-series (12th-gen), x86-64 — featherweight for a data-only restic node
- **RAM:** 16 GB+ (TBD exact)
- **Storage:** OS NVMe + **2× 8 TB HDD in a mirror** (ZFS/mdraid → 8 TB usable, survives one disk)
- **NICs:** wired GbE
- **Notes:** off-cluster pull backup node (ADR-022); owns the restic repo, runs rclone→pCloud,
docks the rotated USB air-gap drives. **Pending:** SATA power cable to the HDDs.
Crown-jewel host → full `base` hardening. Assignment provisional (revisit when all hardware on hand).
```
- [ ] **Step 3: Update the machine-readable capacity table**
In §4 "Node capacity", change the `ubongo` row disk from `250` to `1000` and add a `fisi` row. Keep the header and integer/decimal format intact (parsed by `capacity-scan.py`):
From:
```
| ubongo | 4 | 16 | 250 |
```
To:
```
| ubongo | 4 | 16 | 1000 |
| fisi | 4 | 16 | 8000 |
```
- [ ] **Step 4: Update CAPABILITIES §9**
In `docs/CAPABILITIES.md` §9 table, replace the three backup rows:
From:
```
| Backup engine | Proxmox Backup Server · restic | P | planned | VM backups (PBS) + file/DB dumps (restic) | TODO 3.8 |
| Off-site target | pCloud | S | planned | Off-site copy of backups (3-2-1) | |
| Air-gap target | USB hard drives | S | maybe-later | Periodic cold/air-gapped copy | Manual rotation |
```
To:
```
| Backup engine | restic (data-only) | S | committed | Per-service state: file dirs + logical DB dumps, pulled by `fisi` | ADR-022 (PBS deferred) |
| Off-site target | pCloud (via rclone) | S | committed | Encrypted off-site copy of the restic repo (3-2-1) | ADR-022; sync-coupled |
| Air-gap target | USB hard drives | S | committed | Rotated offline cold copy — the immutable backstop | ADR-022; udev-triggered `restic copy` |
```
- [ ] **Step 5: Verify**
Run: `make lint`
Expected: PASS.
Run: `python3 scripts/capacity-scan.py >/dev/null && echo CAPACITY_OK`
Expected: `CAPACITY_OK` (the capacity table headers are still parseable; new `fisi` row accepted).
Run: `grep -n "ADR-022" docs/CAPABILITIES.md`
Expected: three matches (the updated backup rows).
- [ ] **Step 6: Commit**
```bash
git add docs/hardware/reference.md docs/CAPABILITIES.md
git commit -m "docs(backup): update hardware ref (ubongo M70q, add fisi) + CAPABILITIES §9 (ADR-022)"
```
---
### Task 7: Final review and merge
- [ ] **Step 1: Full lint + capacity sanity**
Run: `make lint && python3 scripts/capacity-scan.py >/dev/null && echo ALL_GREEN`
Expected: `ALL_GREEN`.
- [ ] **Step 2: Cross-reference audit**
Run: `grep -rln "ADR-022\|022-backup" CLAUDE.md STATUS.md docs/ .claude/`
Expected: ADR file, CLAUDE.md, STATUS.md, TODO.md, service-checklist.md, new-role.md, CAPABILITIES.md, check-backup.md all listed — no dangling reference, no file missed.
- [ ] **Step 3: Merge to main and delete the branch**
```bash
git checkout main
git merge --no-ff feat/backup-foundation -m "feat(backup): backup strategy foundation layer (ADR-022)"
git branch -d feat/backup-foundation
git push origin main
```
---
## Self-review (completed by plan author)
- **Spec coverage:** All 13 decisions are recorded in ADR-022 (Task 1, Step 1). The *foundation* obligations of Decisions 6 (contract + BACKUP.md), 7 (dumps-first wording in template/runbook), and the doc/inventory facts (Decisions 4/8 hardware) are implemented as concrete files in Tasks 26. Decisions whose *implementation* is live infra — 1/3/9/11/12/13 (engine, retention, air-gap mechanism, alerting, schedule) and 8's restore-testing — are explicitly deferred to Plans 23 (see *Decomposition & roadmap*), not silently dropped.
- **Placeholder scan:** No "TBD/implement later" steps; every edit shows exact from→to text or full file content. (`<service>`/`<name>` inside template/contract bodies are intentional doc placeholders for the eventual role author, not plan gaps.)
- **Consistency:** `backup__*` field names (`backup__service`, `backup__state`, `backup__paths`, `backup__dumps[].cmd/.dest`, `backup__quiesce`) are identical across the ADR (Task 1), template + contract (Task 2), checklist (Task 3), runbook (Task 4), and `/check-backup` (Task 5). The governance triad matches ADR-021's (template / checklist line / runbook step / dormant verifier), and the "no lint script" choice is stated in both the plan header and the ADR.

View file

@ -0,0 +1,58 @@
# `dev_env` Role — Implementation Plan (iteration 1)
> Built in the same 2026-06-11 session as the `ubongo` bring-up. A developer
> interactive environment (zsh/tmux/nvim) for **workstation-class** hosts.
**Goal:** Give `ubongo` (and future `mamba`) a clean interactive shell/editor setup,
reproducibly, as a boma-native Ansible role — so the operator (and the `claude` agent
user) can work comfortably over SSH.
## Decisions
- **Separate role, never part of `base`.** `base` is the security/infra baseline for
*every* host; a dev environment is only for human workstation-class hosts. Servers and
service VMs must never get it.
- **Stow, not templating.** Dotfiles are **real files** under `files/dotfiles/{zsh,tmux,nvim}/`
(re-derived `$HOME`-relative from `fisi`'s live configs), symlinked into `~` with GNU
stow. No Jinja-templated dotfiles (they rot; you'd edit templates not configs).
- **Users:** `dev_env__users` (default `[]`). Set to `[sjat, claude]` for `ubongo` in
`group_vars/control`.
- **V4 (ADR-013):** configs/package-lists/install-mechanism *consulted* from V4 and
**re-derived on boma's terms** — not its structure. V4 identifiers stripped from the
dotfiles.
## Re-derivations vs V4
- **No Nerd Font** on `ubongo` — it's headless; fonts are a client-side concern.
- **No system-wide LSP suite** — the operator's nvim uses **mason**, which self-installs
LSPs/formatters inside nvim (needs only nvim + git + a C compiler + node).
- **Pinned versions** (ADR-014): nvim `v0.12.2`, oh-my-posh `29.0.1` (V4 tracks "latest").
- **Plugins self-bootstrap**: lazy.nvim installs nvim plugins on first launch; the role
only lays down config + pre-clones omz/tmux plugins.
## Tasks (role: `roles/dev_env/`)
- `tasks/main.yml` — apt packages (`packages` tag) → include `neovim.yml`, `oh_my_posh.yml`
→ loop `per_user.yml` over `dev_env__users`.
- `tasks/neovim.yml` — install pinned nvim release to `/opt`, symlink, version sentinel.
- `tasks/oh_my_posh.yml` — install pinned oh-my-posh binary + deploy `zen.toml` to `/etc`.
- `tasks/per_user.yml` — set login shell to zsh (`users`); clone oh-my-zsh + custom
plugins + tmux/TPM plugins; copy dotfiles to `~/.dotfiles`; `stow` into `~` (`config`).
- `defaults/main.yml`, `meta/main.yml`, `README.md`, `requirements.yml`.
- `molecule/default/{converge,verify}.yml` — create a `tester` user, apply, assert
packages + nvim/omp/zen present + shell=zsh + dotfiles stowed (symlinks).
- `playbooks/workstation.yml` — apply `dev_env` to the `control` group (ubongo).
- `inventories/production/group_vars/control/vars.yml``dev_env__users: [sjat, claude]`.
## Verify / apply
- `make lint`; `make test ROLE=dev_env` (Molecule, Debian 13) must pass.
- Apply to `ubongo`: `make check`/`deploy PLAYBOOK=workstation` from a host that can SSH
to `ubongo` as `sjat` with `--ask-become-pass` (the Ansible-manages-ubongo connection
isn't bootstrapped yet — handle at apply time).
## Deferred (iteration 2+)
- A proper `workstations` inventory group (when `mamba` joins) instead of reusing `control`.
- lazygit, extra CLI tooling, any system LSP/formatters mason can't cover.
- Pinning tmux plugins to commits (currently `master` except catppuccin `v1.0.3`).

View file

@ -0,0 +1,150 @@
# Ubongo Physical Build — Implementation Plan
> **For agentic workers:** Execute task-by-task. This is the **physical bring-up** of
> `ubongo`. The 2026-06-05 plan (`2026-06-05-ubongo-control-host.md`) was
> *documentation-only* (it authored ADR-015); this is its sequel — taking the actual
> box from bare Debian 13 to a working control / AI-worker node.
**Goal:** Bring the Lenovo ThinkCentre M70q from a fresh Debian 13 install to a working
control node: toolchain, dedicated `claude` identity, repo + Claude Code, vault access,
inventory wiring, keys-only SSH, and reconciliation of the docs to "built."
**Spec / decisions of record:** ADR-015 + `docs/superpowers/specs/2026-06-05-ubongo-control-host-design.md`,
plus the interactive build decisions captured below (2026-06-11 session).
---
## Decisions made this session (2026-06-11)
- **Hardware:** Lenovo ThinkCentre M70q Tiny · i3-10100T (4c/8t) · 16 GB · 256 GB
SanDisk X600 SATA SSD (TCG **Opal**-capable; Opal **unused**, see encryption).
- **BIOS:** auto-power-on after loss; Wake-on-LAN on; ErP/deep-S5 off; **supervisor
password set**; external/USB + PXE boot **disabled**; Secure Boot on; TPM (PTT) on;
VT-x/VT-d on; Better-Thermal cooling.
- **Disk encryption: NONE.** Accepted risk — compensated by physical security + BIOS
supervisor password + disabled external boot. Recorded in `accepted-risks.md` (Task H1).
- **Partitioning:** simple single ext4 root (`/dev/sda2`, 221 G) + 12 G swap, no LVM.
Revisit via reinstall onto LVM/bigger drive only if the layout bites.
- **Identity:** dedicated **`claude`** user — for **attribution + revocation, not
containment**. In the `docker` group (Molecule); **no local sudo** (boma deploys run
over SSH as `ansible`; the agent needs Docker, not root). Reached via `sudo -iu claude`
from `sjat`. Own `ed25519` key for Forgejo. ADR-021 leaves this identity open — note it.
- **Access:** LAN SSH only for now — the NetBird mesh (ADR-016) is deferred (`askari` +
service machinery unbuilt). Keys-only enforced after bootstrap.
- **Address:** `10.20.10.151/24` on `eno1`. Make stable via an OPNsense DHCP reservation.
**Pinned versions (match `fisi`):** docker 29.5.2 · rbw 1.15.0 · node 20.19.2 ·
claude 2.1.173. Terraform is absent on `fisi` (TF un-init'd) — install deferred.
---
## Pre-flight
- **Temp passwordless sudo** for `sjat` during the build (`/etc/sudoers.d/99-boma-build`);
**removed in Task F2**. Without it, non-interactive SSH `sudo` hangs.
- **`rbw unlock`** on `fisi` before any commit (pre-commit decrypts `vault.yml`).
- **Commit style:** one commit per logical unit; imperative subject ≤72 chars; trailer
`Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`.
- Drive the live box (`ubongo`) directly over SSH; do repo/doc tasks (H) as clean commits.
---
## Stage A — Toolchain (on `ubongo`, via `sjat` sudo)
- [ ] **A1.** apt base: `git make build-essential python3-venv python3-pip curl
ca-certificates gnupg jq` (+ `apt update`).
- [ ] **A2.** Docker Engine from Docker's official apt repo (Debian 13/trixie); enable +
start; confirm `docker --version` ≈ 29.5.2.
- [ ] **A3.** `rbw` 1.15.0 — try `apt install rbw`; if the version doesn't match, install
the pinned release binary to `/usr/local/bin` (match `fisi`).
- [ ] **A4.** Node 20.19.2 (nodesource or distro) — only if Claude Code needs it; the
native installer bundles its runtime, so Node may be optional.
- [ ] **A5.** Claude Code via the **native installer** (matches `fisi`'s
`~/.local/share/claude/versions/`), installed under the `claude` user in Stage C.
- [ ] Defer Terraform (absent on `fisi`).
## Stage B — Identity (`claude` user)
- [ ] **B1.** `useradd -m -s /bin/bash claude`; lock the password (`passwd -l claude`) —
reached only via `sudo -iu claude` from `sjat` or its own key.
- [ ] **B2.** Add `claude` to the `docker` group.
- [ ] **B3.** No sudo for `claude` (explicit decision). Confirm `sudo -iu claude` works.
## Stage C — Repo + Claude Code (as `claude`)
- [ ] **C1.** Generate `claude`'s `ed25519` key; **[USER]** register the public key in
Forgejo (Settings → SSH keys).
- [ ] **C2.** Clone `ssh://git@forgejo.nyumbani.baobab.band:7577/sjat/boma.git` into
`/home/claude/Projects/boma`.
- [ ] **C3.** `make setup` (venv + `requirements.txt`); `make collections`.
- [ ] **C4.** Install Claude Code (native installer) for `claude`; set up plugins/MCP/
settings per `docs/runbooks/claude-code-setup.md`. Set git `user.name`/`user.email`.
## Stage D — Vault (`rbw`)
- [ ] **D1.** `rbw config set base_url https://vaultwarden.baobab.band`; set email.
- [ ] **D2. [USER]** `rbw login` (master password) on `ubongo`; then `rbw sync`,
`rbw unlock`; verify `rbw get boma-ansible-vault` returns the vault password.
- [ ] **D3.** **Offline-cache verification (ADR-015 open item, security-relevant):**
confirm `rbw` decrypts its local cache with Vaultwarden unreachable. Stamp the result
into ADR-015 / `rotate-secrets.md` (replaces the `TO VERIFY` note).
## Stage E — Inventory + base (partial)
- [ ] **E1.** Add `ubongo` to `inventories/production/hosts.yml` under `control`
(manual exception; note `tf-inventory` will overwrite — re-add after).
- [ ] **E2.** Set `base__firewall_control_addr` to `10.20.10.151` in the appropriate
`group_vars` (the dormant `ssh-from-control` knob, ADR-020/021).
- [ ] **E3.** `make check PLAYBOOK=site` against `control`; apply the built `firewall`
concern only (SSH-hardening/fail2ban/auditd concerns are unbuilt — note the gap).
## Stage F — Hardening / address
- [ ] **F1.** Disable SSH password auth (keys-only) via `/etc/ssh/sshd_config.d/`;
`PermitRootLogin no`; reload `sshd` (we're on a key, so safe).
- [ ] **F2.** **Remove the temp NOPASSWD** drop-in (`/etc/sudoers.d/99-boma-build`).
- [ ] **F3. [USER]** OPNsense DHCP reservation for `10.20.10.151`.
## Stage H — Docs reconciliation (repo commits)
- [ ] **H1.** `accepted-risks.md`: add the plaintext-disk accepted risk (compensations:
physical security, BIOS supervisor password, no external boot).
- [ ] **H2.** `docs/hardware/reference.md`: fill `ubongo`'s real specs (M70q, i3-10100T,
16 GB, 256 GB SanDisk X600) into the TBD skeleton; node-capacity row already present.
- [ ] **H3.** `STATUS.md`: move `ubongo` from "Designed but not built" toward built
(note what's live vs. still pending — mesh, full `base`).
- [ ] **H4.** Note the dedicated-`claude` identity decision (short amendment to ADR-021
or ADR-015) and the LAN address.
---
## Out of scope this session
- **Mesh VPN** (NetBird) — needs `askari` + service roles (ADR-016). SSH stays LAN-only.
- **Full `base` hardening** — SSH/fail2ban/auditd concerns not built (only `firewall`).
- **Recovery wiring (G)** — TF-state backup to `mamba`, rbw mirror — no TF state yet
(TF un-init'd). `mamba` as break-glass clone tracked separately.
---
## Outcome (2026-06-11)
`STATUS.md` is the live source of truth; this is the session record.
**Done:** A (toolchain — Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173; Node deferred),
B (dedicated `claude` user — docker group, no sudo), C (repo cloned, `make setup` +
`collections`, git identity; plugins install on first interactive launch), D (vault via
rbw + **offline-cache decryption verified**), E1/E2 (inventory + `ssh-from-control`
knob), F1 (key-only SSH), F2 (temp NOPASSWD removed), H1H4 (docs reconciled).
**Deferred, with reason:**
- **E3 — apply `base` to `ubongo`:** would push nftables default-deny with SSH allowed
*only on the mesh interface*, but no mesh exists yet → would deny inbound SSH on `eno1`
and strand the box. Wait for NetBird (ADR-016). `base` is also firewall-concern-only.
- **F3 — OPNsense DHCP reservation** for `10.20.10.151` (MAC `88:a4:c2:e0:ee:da`): operator action.
- **Mesh enrollment, full `base` hardening, recovery wiring (G):** out of scope (above).
**Follow-ups flagged:** (1) `ubongo` sits in `10.20.10.0/24`, which doesn't match
ADR-007's zone map (`srv: 10.20.0.0/24`) — network-design drift to reconcile. (2) The
hardware reference previously assumed `ubongo` had 1 TB NVMe for an ADR-022 "restore-verify"
role; the real disk is 256 GB — check ADR-022 doesn't bank on the larger size.

View file

@ -0,0 +1,538 @@
# askari Provisioning (M2) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Provision `askari` (the off-site Hetzner VPS) as Terraform IaC — a `hetzner_vm` module + an `offsite` stack — behind a TF-managed cloud firewall, hand it into the `offsite_hosts` inventory, and bootstrap it.
**Architecture:** Generalize boma's "Terraform owns VM existence" principle (ADR-006) from Proxmox to Hetzner. A reusable `hetzner_vm` module wraps `hcloud_server` + `hcloud_firewall` + `hcloud_ssh_key`; an `offsite` environment (own local state) declares `askari` (CAX11/ARM, Helsinki, Debian 13). cloud-init creates the `ansible` user with ubongo's key; the firewall allows SSH from ubongo only. Handoff stays ADR-009-shaped: the offsite env outputs `vms`, and `tf_to_inventory.py` (already offsite-aware) generates an inventory file merged via a **directory inventory**.
**Tech Stack:** Terraform (`hetznercloud/hcloud` provider), Hetzner Cloud, cloud-init, Ansible. Token from `vault.hetzner.token``TF_VAR_hcloud_token`.
**Spec:** `docs/superpowers/specs/2026-06-14-askari-provisioning-design.md`
**Execution context:** Tasks 16 + 9 are authoring + `terraform fmt/validate/plan` (need `terraform` installed + the token, but no resources are created). **Task 7 (`terraform apply`) and Task 8 (bootstrap) create a real, billed VPS** — gated, run with explicit user go, `tf-plan` shown first (CLAUDE.md). If `terraform` is absent in the working env, Tasks 68 defer to ubongo.
---
## File Structure
- `terraform/modules/hetzner_vm/{variables,main,outputs}.tf` (create) — wraps server + firewall + ssh key + cloud-init.
- `terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf` + `terraform.tfvars.example` (create) — the askari stack, own local state.
- `Makefile` (modify) — inject `TF_VAR_hcloud_token` for `TF_ENV=offsite`; directory inventory; `tf-inventory-offsite` target.
- `scripts/tf_to_inventory.py` (no change — already offsite-aware) + `tests/test_tf_to_inventory.py` (create) — lock the offsite handoff.
- `docs/decisions/{006,009,020,007,016}-*.md`, `STATUS.md` (modify) — ADR amendments + status.
---
### Task 1: Verify the Hetzner provider/image facts (ADR-014)
**Files:** none (research; pin values used by later tasks).
- [ ] **Step 1: Verify and record**
Verify (WebFetch registry.terraform.io / docs.hetzner.com, or `terraform` once init'd):
- latest `hetznercloud/hcloud` provider version to pin (expected `~> 1.48`+),
- the Debian 13 image slug (expected `debian-13`),
- that server type `cax11` exists in location `hel1`.
Record a stamp in the offsite `providers.tf` comment, e.g.:
`# verified: hetznercloud/hcloud <ver> · debian-13 image · cax11@hel1 · <source> · <date>`
- [ ] **Step 2: No commit** (values land in later tasks).
---
### Task 2: The `hetzner_vm` module
**Files:**
- Create: `terraform/modules/hetzner_vm/variables.tf`, `main.tf`, `outputs.tf`
- [ ] **Step 1: `variables.tf`**
```hcl
variable "name" {
description = "Server name (and hostname)"
type = string
}
variable "server_type" {
description = "Hetzner server type, e.g. cax11 (ARM)"
type = string
}
variable "location" {
description = "Hetzner location, e.g. hel1"
type = string
}
variable "image" {
description = "OS image slug, e.g. debian-13"
type = string
}
variable "ansible_ssh_pubkey" {
description = "Public SSH key provisioned for the ansible user via cloud-init"
type = string
}
variable "ssh_admin_cidrs" {
description = "Source CIDRs allowed to reach SSH (e.g. ubongo's address/32)"
type = list(string)
}
variable "labels" {
description = "Hetzner resource labels (metadata only)"
type = map(string)
default = {}
}
```
- [ ] **Step 2: `main.tf`**
```hcl
# cloud-init: create the unprivileged `ansible` user with ubongo's key + sudo.
# (Mirrors the proxmox_vm module's user_account; Hetzner has no structured field.)
locals {
user_data = <<-EOT
#cloud-config
users:
- name: ansible
groups: [sudo]
sudo: "ALL=(ALL) NOPASSWD:ALL"
shell: /bin/bash
ssh_authorized_keys:
- ${var.ansible_ssh_pubkey}
package_update: true
packages:
- python3
EOT
}
resource "hcloud_ssh_key" "ansible" {
name = "${var.name}-ansible"
public_key = var.ansible_ssh_pubkey
}
resource "hcloud_firewall" "this" {
name = "${var.name}-fw"
# SSH from the control node only (NetBird ports are added in M4 when the
# coordinator deploys — see ADR-020; the host nftables layer is catalog-driven).
rule {
direction = "in"
protocol = "tcp"
port = "22"
source_ips = var.ssh_admin_cidrs
}
}
resource "hcloud_server" "this" {
name = var.name
server_type = var.server_type
location = var.location
image = var.image
ssh_keys = [hcloud_ssh_key.ansible.id]
user_data = local.user_data
firewall_ids = [hcloud_firewall.this.id]
labels = var.labels
public_net {
ipv4_enabled = true
ipv6_enabled = true
}
}
```
- [ ] **Step 3: `outputs.tf`**
```hcl
output "ipv4_address" {
description = "Server public IPv4"
value = hcloud_server.this.ipv4_address
}
output "name" {
description = "Server name"
value = hcloud_server.this.name
}
```
- [ ] **Step 4: Format**
Run: `terraform fmt terraform/modules/hetzner_vm/`
Expected: files formatted (or already formatted).
- [ ] **Step 5: Commit**
```bash
git add terraform/modules/hetzner_vm
git commit -m "feat(tf): hetzner_vm module (server + firewall + ssh key + cloud-init)"
```
(append `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`)
---
### Task 3: The `offsite` environment
**Files:**
- Create: `terraform/environments/offsite/{providers,variables,main,outputs,backend}.tf`, `terraform.tfvars.example`
- [ ] **Step 1: `providers.tf`** (pin the version from Task 1)
```hcl
# verified: hetznercloud/hcloud ~> 1.48 · debian-13 · cax11@hel1 · <source> · <date>
terraform {
required_version = ">= 1.9"
required_providers {
hcloud = {
source = "hetznercloud/hcloud"
version = "~> 1.48"
}
}
}
provider "hcloud" {
token = var.hcloud_token
}
```
- [ ] **Step 2: `variables.tf`**
```hcl
variable "hcloud_token" {
description = "Hetzner Cloud API token — set via TF_VAR_hcloud_token (from vault.hetzner.token)"
type = string
sensitive = true
}
variable "ansible_ssh_pubkey" {
description = "ubongo's control SSH public key, provisioned for the ansible user"
type = string
}
variable "ssh_admin_cidrs" {
description = "Source CIDRs allowed to SSH askari (ubongo's address/32)"
type = list(string)
}
```
- [ ] **Step 3: `main.tf`**
```hcl
# offsite/main.tf — off-site Hetzner hosts. Terraform owns VM existence (ADR-006,
# generalized to Hetzner). ALWAYS `make tf-plan TF_ENV=offsite` and review before
# `make tf-apply TF_ENV=offsite`.
module "askari" {
source = "../../modules/hetzner_vm"
name = "askari"
server_type = "cax11" # ARM, 2 vCPU / 4 GB
location = "hel1" # Helsinki
image = "debian-13"
ansible_ssh_pubkey = var.ansible_ssh_pubkey
ssh_admin_cidrs = var.ssh_admin_cidrs
labels = {
env = "offsite"
group = "offsite_hosts"
managed-by = "terraform"
}
}
```
- [ ] **Step 4: `outputs.tf`** (the `tf_to_inventory.py` contract — `vms` map)
```hcl
output "vms" {
description = "Hostname → IP and Ansible group — consumed by make tf-inventory-offsite"
value = {
askari = {
ip = module.askari.ipv4_address
group = "offsite_hosts"
}
}
}
```
- [ ] **Step 5: `backend.tf`**
```hcl
# Terraform state: LOCAL, on the control node (like the Proxmox envs; ADR-006).
# askari survives a homelab outage by design, so a lost state is recovered by
# `terraform import` of the running server — not a rebuild. Back the state up with
# the control node (ADR-022).
```
- [ ] **Step 6: `terraform.tfvars.example`**
```hcl
# offsite environment — non-secret values. Copy to terraform.tfvars and fill in.
#
# Secret is exported as an env var (never in this file):
# export TF_VAR_hcloud_token="$(...from vault.hetzner.token...)" # make handles this
#
# State is local (see backend.tf).
ansible_ssh_pubkey = "ssh-ed25519 AAAA... ansible@ubongo"
ssh_admin_cidrs = ["10.20.10.151/32"] # ubongo's LAN address (ADR-021)
```
- [ ] **Step 7: Format + commit**
Run: `terraform fmt terraform/environments/offsite/`
```bash
git add terraform/environments/offsite
git commit -m "feat(tf): offsite environment — askari (CAX11/hel1/debian-13)"
```
(Co-Authored-By trailer)
---
### Task 4: Makefile — token injection, directory inventory, offsite handoff
**Files:**
- Modify: `Makefile`
- [ ] **Step 1: Inject the Hetzner token for `TF_ENV=offsite`**
The `tf-*` targets need `TF_VAR_hcloud_token` for offsite, sourced from the vault. Add a guarded helper variable near the `TF` definition:
```makefile
# For TF_ENV=offsite, export the Hetzner token from the vault (rbw unlocked).
# Reads vault.hetzner.token in-memory; never written to a tfvars file (CLAUDE.md).
ifeq ($(TF_ENV),offsite)
TF_TOKEN_ENV = TF_VAR_hcloud_token="$$($(VENV)/bin/ansible-vault view inventories/production/group_vars/all/vault.yml | $(VENV)/bin/python -c 'import sys,yaml; print(yaml.safe_load(sys)["vault"]["hetzner"]["token"])')"
else
TF_TOKEN_ENV =
endif
```
Then prefix the `tf-init`/`tf-plan`/`tf-apply`/`tf-output` recipes with `$(TF_TOKEN_ENV)`, e.g.:
```makefile
tf-plan:
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/$(TF_ENV) plan
```
(Apply the same prefix to `tf-init`, `tf-apply`, `tf-output`.)
- [ ] **Step 2: Directory inventory**
Change the inventory so multiple TF envs can each generate a file:
```makefile
INVENTORY := -i inventories/production/
```
(Ansible reads every file in the directory as an inventory source and merges them; `group_vars/`/`host_vars/` remain variable dirs. Verify `ansible.cfg` does not also hard-set `inventory=`; if it does, update it to match.)
- [ ] **Step 3: `tf-inventory-offsite` target**
Add (writes the offsite hosts into the production inventory dir, beside the Proxmox-generated `hosts.yml`):
```makefile
tf-inventory-offsite:
$(TF_TOKEN_ENV) $(TF) -chdir=terraform/environments/offsite output -json \
| $(PYTHON) scripts/tf_to_inventory.py > inventories/production/offsite.yml
@echo "Offsite inventory written to inventories/production/offsite.yml"
```
Add `tf-inventory-offsite` to `.PHONY` and a help line.
- [ ] **Step 4: Verify existing playbooks still resolve under the directory inventory**
Run: `make check PLAYBOOK=dns 2>&1 | tail -3`
Expected: still resolves the `control` host and runs (no inventory errors). If `connection:`/group_vars break, fix before committing.
- [ ] **Step 5: Commit**
```bash
git add Makefile
git commit -m "feat(make): offsite TF token injection + directory inventory + tf-inventory-offsite"
```
(Co-Authored-By trailer)
---
### Task 5: Lock the offsite inventory handoff (TDD)
**Files:**
- Test: `tests/test_tf_to_inventory.py`
- [ ] **Step 1: Write the failing test**
```python
import json
import pathlib
import subprocess
import sys
_SCRIPT = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "tf_to_inventory.py"
def _run(tf_output: dict) -> str:
return subprocess.run(
[sys.executable, str(_SCRIPT)],
input=json.dumps(tf_output), capture_output=True, text=True, check=True,
).stdout
def test_offsite_host_lands_in_offsite_hosts():
out = _run({"vms": {"value": {"askari": {"ip": "203.0.113.7", "group": "offsite_hosts"}}}})
assert "offsite_hosts:" in out
assert "askari:" in out
assert "ansible_host: 203.0.113.7" in out
def test_unknown_group_rejected():
proc = subprocess.run(
[sys.executable, str(_SCRIPT)],
input=json.dumps({"vms": {"value": {"x": {"ip": "1.2.3.4", "group": "nope"}}}}),
capture_output=True, text=True,
)
assert proc.returncode == 1
assert "unknown group" in proc.stderr
```
- [ ] **Step 2: Run it**
Run: `.venv/bin/python -m pytest tests/test_tf_to_inventory.py -v`
Expected: PASS — `tf_to_inventory.py` already supports `offsite_hosts` and rejects unknown groups (this test locks that behaviour for the M2 handoff; no code change needed). If it fails, fix `scripts/tf_to_inventory.py` minimally and report.
- [ ] **Step 3: Commit**
```bash
git add tests/test_tf_to_inventory.py
git commit -m "test(tf): lock the offsite_hosts inventory handoff"
```
(Co-Authored-By trailer)
---
### Task 6: Init, validate, plan (gated — needs terraform + token)
> Needs `terraform` installed and `rbw` unlocked. Creates **no** resources. If `terraform` is absent, defer Tasks 68 to ubongo.
- [ ] **Step 1: Set tfvars**
`cp terraform/environments/offsite/terraform.tfvars.example terraform/environments/offsite/terraform.tfvars` and set `ansible_ssh_pubkey` to ubongo's real control public key and `ssh_admin_cidrs` to ubongo's address (`10.20.10.151/32`). (`terraform.tfvars` is gitignored.)
- [ ] **Step 2: Init (tracks the lock file)**
Run: `make tf-init TF_ENV=offsite`
Expected: providers installed; `terraform/environments/offsite/.terraform.lock.hcl` created. `git add` the lock file (tracked per CLAUDE.md).
- [ ] **Step 3: Validate + plan**
Run: `terraform -chdir=terraform/environments/offsite validate``Success`.
Run: `make tf-plan TF_ENV=offsite` → review: **1 server + 1 firewall + 1 ssh key to add**. Confirm CAX11/hel1/debian-13 and the SSH-from-ubongo rule.
- [ ] **Step 4: Commit the lock file**
```bash
git add terraform/environments/offsite/.terraform.lock.hcl
git commit -m "chore(tf): pin offsite provider lock (hcloud)"
```
(Co-Authored-By trailer)
---
### Task 7: Apply — create askari (GATED, real billed VPS)
> **Explicit user go required.** Run on ubongo. The plan from Task 6 must be reviewed first (CLAUDE.md: never apply without a shown plan).
- [ ] **Step 1: Apply**
Run: `make tf-apply TF_ENV=offsite`
Expected: `hcloud_ssh_key`, `hcloud_firewall`, `hcloud_server.askari` created; outputs show `askari`'s IPv4.
- [ ] **Step 2: Generate the offsite inventory**
Run: `make tf-inventory-offsite`
Expected: `inventories/production/offsite.yml` written with `askari` under `offsite_hosts`.
- [ ] **Step 3: Verify the inventory merges**
Run: `.venv/bin/ansible-inventory $(INVENTORY) --host askari` (or `--list`)
Expected: `askari` present with its `ansible_host`.
- [ ] **Step 4: Commit the generated inventory**
```bash
git add inventories/production/offsite.yml
git commit -m "chore(inventory): askari in offsite_hosts (generated)"
```
(Co-Authored-By trailer)
---
### Task 8: Bootstrap askari (GATED — needs the live host)
> Run on ubongo after Task 7. `rbw` unlocked.
- [ ] **Step 1: Reach it**
Run: `ssh ansible@<askari-ip>` (cloud-init created the `ansible` user with ubongo's key) — expect a shell. If refused, check the firewall `ssh_admin_cidrs` matches ubongo's egress IP.
- [ ] **Step 2: Bootstrap**
Run: `make check PLAYBOOK=bootstrap` (review) then `make deploy PLAYBOOK=bootstrap` — expect the `ansible` user + sudoers confirmed/created on askari (idempotent).
- [ ] **Step 3: No repo commit** — this configures the host, not the repo. (`base` subset = M3.)
---
### Task 9: ADR amendments + STATUS
**Files:**
- Modify: `docs/decisions/006-terraform.md`, `009-provisioning-handoff.md`, `020-firewall.md`, `007-network.md`, `016-mesh-vpn.md`, `STATUS.md`
For each: **Read the relevant section first**, then apply the change.
- [ ] **Step 1: ADR-006 — generalize the provider scope**
In the **Providers** section, the line "`bpg/proxmox` … This is the only provider." → note a second provider:
```
**`hetznercloud/hcloud`**: owns off-site VM existence (`askari`). ADR-006's scope is
**Proxmox + Hetzner** — "Terraform owns VM existence" generalizes across providers; the
`offsite` environment + `hetzner_vm` module live alongside the Proxmox env + module.
```
Also adjust the Context line "creating and destroying VMs on Proxmox" → "on Proxmox and Hetzner".
- [ ] **Step 2: ADR-009 — offsite handoff**
Add a note that `offsite` is a TF environment whose `vms` output feeds `offsite_hosts` via `tf_to_inventory.py` (`make tf-inventory-offsite``inventories/production/offsite.yml`), and that the production inventory is a **directory** merging the Proxmox + offsite generated files.
- [ ] **Step 3: ADR-020 — askari's perimeter**
Note that off-cluster `askari` has no OPNsense; its **perimeter** is a TF-managed Hetzner Cloud Firewall (SSH-from-ubongo now; NetBird ports in M4). The `group_vars` catalog stays authoritative for the host nftables layer.
- [ ] **Step 4: ADR-007 / ADR-016 — askari is TF-provisioned**
Replace "provisioned … independently … added manually" wording for askari with "provisioned as Terraform IaC (hcloud), managed independently of the Proxmox cluster (own provider + state)."
- [ ] **Step 5: STATUS.md**
Move/realize askari's row per how far Task 7/8 got. If applied: under "Real and working today" — `askari` **Built + applied** (CAX11/hel1/debian-13, cloud firewall SSH-from-ubongo, bootstrapped, in `offsite_hosts`). If only authored (apply deferred): note the TF is written + `tf-plan` clean, apply pending on ubongo.
- [ ] **Step 6: Lint + commit**
Run: `make lint` (must pass).
```bash
git add docs/decisions/006-terraform.md docs/decisions/009-provisioning-handoff.md \
docs/decisions/020-firewall.md docs/decisions/007-network.md \
docs/decisions/016-mesh-vpn.md STATUS.md
git commit -m "docs(askari): amend ADR-006/009/020/007/016 for TF-provisioned offsite host; STATUS"
```
(Co-Authored-By trailer)
---
## Self-Review (completed)
- **Spec coverage:** TF owns existence / generalize ADR-006 (Decision 1) → Tasks 2,3,9; CAX11/hel1/debian-13 (Decision 2) → Task 3; TF cloud firewall, SSH-from-ubongo, NetBird ports later (Decision 3) → Task 2 + Task 9 ADR-020; token via `TF_VAR_hcloud_token` from vault (Decision 4) → Task 4; ADR-009 handoff via `tf_to_inventory` (Decision 5) → Tasks 4,5,7; cloud-init `ansible` user + bootstrap → Tasks 2,8; state + DR (import) → Task 3 backend; ADR amendments → Task 9. All covered.
- **Placeholder scan:** none — HCL, make, and test content are concrete. `<askari-ip>`/`<source>`/`<date>` are runtime/verification values, not unspecified logic.
- **Type/name consistency:** module vars (`name`, `server_type`, `location`, `image`, `ansible_ssh_pubkey`, `ssh_admin_cidrs`, `labels`) match between module + env call; the `vms` output shape (`{ip, group}`) matches `tf_to_inventory.py`'s contract; `TF_VAR_hcloud_token``var.hcloud_token`; `vault.hetzner.token` matches the stored key.
- **Notes for the implementer:** (a) confirm Ansible merges the directory inventory's two files so `askari` resolves (Task 7 Step 3); (b) verify `hcloud_server` arg names against the pinned provider version (Task 1) — adjust `public_net`/`firewall_ids` if the provider differs; (c) Tasks 78 create a billed VPS — gated on explicit go.

View file

@ -0,0 +1,250 @@
# base SSH hardening + fail2ban (M3) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add SSH-hardening + fail2ban concerns to the `base` role (ADR-002 baseline) and apply them to askari — without locking anything out.
**Architecture:** Two new `base` task files (`ssh.yml`, `fail2ban.yml`), both under the existing `hardening` concern tag, included after `firewall.yml`. Applied to askari **by tag** (`hardening`) so the host firewall (default-deny) is NOT applied pre-mesh — the Hetzner Cloud Firewall remains askari's perimeter until M5. A `LIMIT=`/`TAGS=` passthrough on `make check/deploy` enables the targeted apply.
**Tech Stack:** Ansible (`ansible.builtin`, `ansible.posix.authorized_key` — already vendored), sshd drop-in config, fail2ban.
**Spec:** `docs/superpowers/specs/2026-06-14-base-ssh-fail2ban-m3-design.md`
**Execution context:** Tasks 13 author + Molecule (Docker available). **Task 4 applies to live askari** (gated; reachable from ubongo). No new billed resources.
---
### Task 1: `make check/deploy` LIMIT + TAGS passthrough
**Files:** Modify `Makefile` (the `check` and `deploy` recipes).
- [ ] **Step 1:** In the `check:` recipe, change the command line to:
```makefile
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) --check --diff playbooks/$(PLAYBOOK).yml
```
- [ ] **Step 2:** In the `deploy:` recipe, change the command line to:
```makefile
$(PLAYBOOK_BIN) $(INVENTORY) $(VAULT_ARGS) $(if $(LIMIT),--limit $(LIMIT)) $(if $(TAGS),--tags $(TAGS)) playbooks/$(PLAYBOOK).yml
```
- [ ] **Step 3:** Add help lines noting `[LIMIT=<host>] [TAGS=<tags>]` are optional on check/deploy.
- [ ] **Step 4:** Sanity-check it parses: `make check PLAYBOOK=dns LIMIT=control TAGS=public_dns 2>&1 | tail -2` (should run check-mode scoped to control). Expected: no make/syntax error.
- [ ] **Step 5:** Commit:
```bash
git add Makefile
git commit -m "feat(make): optional LIMIT= and TAGS= passthrough on check/deploy"
```
(append `Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>`)
---
### Task 2: base `hardening` concern — ssh + fail2ban
**Files:** Create `roles/base/tasks/ssh.yml`, `roles/base/tasks/fail2ban.yml`, `roles/base/templates/sshd_hardening.conf.j2`, `roles/base/templates/fail2ban_sshd.local.j2`; modify `roles/base/tasks/main.yml`, `roles/base/defaults/main.yml`, `roles/base/handlers/main.yml`, `inventories/production/group_vars/all/vars.yml`.
- [ ] **Step 1:** Append to `roles/base/defaults/main.yml`:
```yaml
# SSH hardening + fail2ban (ADR-002) — `hardening` concern.
base__ssh_password_authentication: "no"
base__ssh_permit_root_login: "no"
base__fail2ban_maxretry: 5
base__fail2ban_bantime: 1h
base__fail2ban_findtime: 10m
# base__ssh_authorised_keys lives in group_vars/all/vars.yml (per-person control keys).
```
- [ ] **Step 2:** Create `roles/base/templates/sshd_hardening.conf.j2`:
```
# Managed by Ansible (base role, ADR-002). Do not edit on the host.
PasswordAuthentication {{ base__ssh_password_authentication }}
PermitRootLogin {{ base__ssh_permit_root_login }}
PubkeyAuthentication yes
KbdInteractiveAuthentication no
```
- [ ] **Step 3:** Create `roles/base/templates/fail2ban_sshd.local.j2`:
```
# Managed by Ansible (base role, ADR-002).
[sshd]
enabled = true
maxretry = {{ base__fail2ban_maxretry }}
bantime = {{ base__fail2ban_bantime }}
findtime = {{ base__fail2ban_findtime }}
```
- [ ] **Step 4:** Create `roles/base/tasks/ssh.yml`:
```yaml
---
- name: Ensure openssh-server is installed
ansible.builtin.apt:
name: openssh-server
state: present
update_cache: true
- name: Render hardened sshd drop-in
ansible.builtin.template:
src: sshd_hardening.conf.j2
dest: /etc/ssh/sshd_config.d/10-boma.conf
owner: root
group: root
mode: "0644"
notify: reload sshd
- name: Validate the full sshd config (drop-in included)
ansible.builtin.command: sshd -t
changed_when: false
- name: Authorise control SSH keys for the ansible user
ansible.posix.authorized_key:
user: "{{ ansible_user | default('ansible') }}"
key: "{{ base__ssh_authorised_keys | join('\n') }}"
exclusive: true
when: base__ssh_authorised_keys | length > 0
```
- [ ] **Step 5:** Create `roles/base/tasks/fail2ban.yml`:
```yaml
---
- name: Install fail2ban
ansible.builtin.apt:
name: fail2ban
state: present
update_cache: true
- name: Configure the sshd jail
ansible.builtin.template:
src: fail2ban_sshd.local.j2
dest: /etc/fail2ban/jail.d/sshd.local
owner: root
group: root
mode: "0644"
notify: restart fail2ban
- name: Enable and start fail2ban
ansible.builtin.service:
name: fail2ban
enabled: true
state: started
```
- [ ] **Step 6:** Replace `roles/base/handlers/main.yml`:
```yaml
---
- name: Reload sshd
listen: reload sshd
ansible.builtin.service:
name: ssh
state: reloaded
- name: Restart fail2ban
listen: restart fail2ban
ansible.builtin.service:
name: fail2ban
state: restarted
```
- [ ] **Step 7:** In `roles/base/tasks/main.yml`, add after the firewall include:
```yaml
- name: SSH hardening
ansible.builtin.include_tasks: ssh.yml
tags: [hardening]
- name: fail2ban intrusion deterrence
ansible.builtin.include_tasks: fail2ban.yml
tags: [hardening]
```
- [ ] **Step 8:** In `inventories/production/group_vars/all/vars.yml`, set `base__ssh_authorised_keys` (replace the empty `[]`):
```yaml
base__ssh_authorised_keys:
- "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKSx1TFLJ9H8vCe5ZJSu7MYmAiH0/OC8evloQjGR0Bqw claude@ubongo"
```
- [ ] **Step 9:** `make lint` — expect `0 failure(s)` + `check-tags: OK` (the `hardening` tag is already in `tests/tags.yml`).
- [ ] **Step 10:** Commit:
```bash
git add roles/base inventories/production/group_vars/all/vars.yml
git commit -m "feat(base): ssh hardening + fail2ban (hardening concern, ADR-002)"
```
(Co-Authored-By trailer)
---
### Task 3: Molecule coverage
**Files:** Modify `roles/base/molecule/default/converge.yml`, `roles/base/molecule/default/verify.yml`.
- [ ] **Step 1:** In `converge.yml`, the role already runs with `base__firewall_apply: false`. Leave `base__ssh_authorised_keys` unset (defaults to `[]` → the `authorized_key` task is skipped, no test user needed). No converge change needed unless vars are missing — confirm the play still has `roles: [base]`.
- [ ] **Step 2:** Append assertions to `verify.yml` (after the existing firewall checks):
```yaml
- name: sshd drop-in present and config valid
ansible.builtin.command: sshd -t
changed_when: false
tags: [verify]
- name: PasswordAuthentication is disabled
ansible.builtin.command: grep -q '^PasswordAuthentication no' /etc/ssh/sshd_config.d/10-boma.conf
changed_when: false
tags: [verify]
- name: fail2ban sshd jail configured
ansible.builtin.command: grep -q '^\[sshd\]' /etc/fail2ban/jail.d/sshd.local
changed_when: false
tags: [verify]
```
- [ ] **Step 3:** Run `make test ROLE=base`. Expected: converge installs openssh-server + fail2ban, renders the drop-ins, validates sshd, starts fail2ban; verify passes; idempotence clean. If the Molecule image lacks systemd-for-fail2ban or apt fails offline, capture the error (the image is systemd-enabled per `molecule.yml`).
- [ ] **Step 4:** Commit:
```bash
git add roles/base/molecule
git commit -m "test(base): Molecule coverage for ssh hardening + fail2ban"
```
(Co-Authored-By trailer)
---
### Task 4: Apply to askari (gated — live host)
> Runs against live askari (reachable from ubongo). `rbw` unlocked. Applies ONLY the
> `hardening` concern (`--tags hardening`) so the host firewall is not touched.
- [ ] **Step 1: Dry-run.** `make check PLAYBOOK=site LIMIT=askari TAGS=hardening` — review: openssh-server present, sshd drop-in (`PasswordAuthentication no`, `PermitRootLogin no`), authorized_key for `ansible`, fail2ban installed + sshd jail. Confirm NO firewall tasks appear.
- [ ] **Step 2: Apply.** `make deploy PLAYBOOK=site LIMIT=askari TAGS=hardening` — expect changed for the drop-in, fail2ban install/config; `failed=0`.
- [ ] **Step 3: Verify SSH still works (lock-out guard).** `.venv/bin/ansible offsite_hosts -m ping``pong`. And `.venv/bin/ansible offsite_hosts -b -m command -a 'sshd -t'` → rc=0.
- [ ] **Step 4: Verify fail2ban.** `.venv/bin/ansible offsite_hosts -b -m command -a 'fail2ban-client status sshd'` → shows the sshd jail active.
- [ ] **Step 5: Idempotence.** Re-run Step 2 → `changed=0`.
- [ ] **Step 6: No repo commit** (configures the host, not the repo).
---
### Task 5: Docs
**Files:** Modify `STATUS.md`, `docs/ROADMAP.md`.
- [ ] **Step 1:** In `STATUS.md`, update the `roles/base/` row (under "Scaffolded but empty"/partial) to note the `hardening` concern (ssh + fail2ban) is now built, and **applied to askari**; firewall concern still pending application (mesh-gated). If askari's row exists in "Real and working today," append "SSH hardened + fail2ban (M3)".
- [ ] **Step 2:** In `docs/ROADMAP.md`, mark **M3** as done (ssh + fail2ban built + applied to askari; NetBird agent deferred to M4; host firewall + ubongo hardening at M5).
- [ ] **Step 3:** `make lint`; commit:
```bash
git add STATUS.md docs/ROADMAP.md
git commit -m "docs(base): M3 — ssh hardening + fail2ban applied to askari; STATUS + roadmap"
```
(Co-Authored-By trailer)
---
## Self-Review (completed)
- **Spec coverage:** ssh + fail2ban concerns under `hardening` (Decision 1) → Task 2;
apply-by-tag, no firewall (Decision 2) → Task 4 (`TAGS=hardening`); `base__ssh_authorised_keys`
populated (Decision 3) → Task 2 Step 8; LIMIT/TAGS passthrough (Decision 4) → Task 1;
ADR-002 controls (key-only, no root, fail2ban 5/1h) → Tasks 2; Molecule + live verify
(testing) → Tasks 3, 4. Deferrals (agent/M4, host-fw+ubongo/M5, auditd/Phase 2) honoured.
- **Placeholder scan:** none — all task/template/handler content is concrete.
- **Name consistency:** `base__ssh_*` / `base__fail2ban_*` / `base__ssh_authorised_keys`
used identically across defaults, templates, tasks, and group_vars; handler listen-topics
(`reload sshd`, `restart fail2ban`) match the `notify:` strings.
- **Lock-out guard:** sshd hardening only disables password+root (we use key+sudo); the
`ansible` user's key is preserved (`base__ssh_authorised_keys` has it); `sshd -t`
validates before reload; firewall untouched (`--tags hardening`). Task 4 verifies SSH
post-apply.

View file

@ -0,0 +1,641 @@
# `/kaizen` Command Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Build the `/kaizen` kaizen-loop command — a stdlib scanner that parses `docs/FRICTION.md` *Open signals* plus an interactive command that curates them (add/change/park/remove) into the decisions ledger.
**Architecture:** Mirrors `/review-repo` exactly: a deterministic stdlib Phase-0 scanner (`scripts/friction-scan.py`, unit-tested) feeds a markdown command (`.claude/commands/kaizen.md`) that drives the interactive curation. The same scanner powers a stage-2 nudge surfaced in `/review-repo`.
**Tech Stack:** Python 3 standard library only (matches `scripts/repo-scan.py`); pytest; markdown command docs.
**Spec:** `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`
---
## File structure
- Create: `scripts/friction-scan.py` — stdlib parser of `FRICTION.md` *Open signals*; `--json` (default) and `--nudge` modes. One responsibility: turn the prose signal log into structured data + the nudge line.
- Create: `tests/test_friction_scan.py` — unit tests for the parser (string-based, deterministic via `--today`), matching `tests/test_repo_scan.py`.
- Create: `.claude/commands/kaizen.md` — the interactive curation process.
- Modify: `.claude/commands/review-repo.md` — add the stage-2 nudge line to its report.
- Modify: `STATUS.md` — add a `/kaizen` row.
- Modify: `docs/TODO.md` — mark item 11.1 in progress / built.
All scanner logic lives in functions that take strings/data (not files) so tests need no fixtures on disk; only `load_signals(path, today)` and `main()` touch the filesystem.
---
## Task 1: Scanner scaffold — section extraction + signal splitting
**Files:**
- Create: `scripts/friction-scan.py`
- Test: `tests/test_friction_scan.py`
- [ ] **Step 1: Write the failing test**
```python
# tests/test_friction_scan.py
import importlib.util
import os
_SPEC = importlib.util.spec_from_file_location(
"friction_scan",
os.path.join(os.path.dirname(__file__), "..", "scripts", "friction-scan.py"),
)
fs = importlib.util.module_from_spec(_SPEC)
_SPEC.loader.exec_module(fs)
SAMPLE = """# FRICTION.md
## Open signals
_(append new raw signals here)_
- `[gotcha]` **First thing** (2026-06-01): body line one.
continuation line two.
- `[friction]` **Second thing** (2026-06-10): only one line.
---
## Kaizen reviews — decisions ledger
- `[gotcha]` **Should not be parsed** (2026-01-01): in the ledger.
"""
def test_extract_open_section_stops_at_next_heading():
section = fs.extract_open_section(SAMPLE)
assert "First thing" in section
assert "Second thing" in section
assert "Should not be parsed" not in section
def test_split_signals_finds_two_items_and_joins_continuations():
signals = fs.split_signals(fs.extract_open_section(SAMPLE))
assert len(signals) == 2
assert "continuation line two" in signals[0]
assert signals[1].startswith("`[friction]`")
```
- [ ] **Step 2: Run test to verify it fails**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
Expected: FAIL — `friction-scan.py` does not exist / `extract_open_section` undefined.
- [ ] **Step 3: Write minimal implementation**
```python
#!/usr/bin/env python3
"""Parse docs/FRICTION.md 'Open signals' into structured data for /kaizen.
Stdlib only. Modes:
--json (default): emit the open signals as JSON (Phase-0 input for /kaizen)
--nudge : print a one-line 'loop overdue?' summary
Authoritative design: docs/superpowers/specs/2026-06-14-kaizen-command-design.md
"""
import argparse
import datetime
import json
import os
import re
REPO_ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
FRICTION = os.path.join(REPO_ROOT, "docs", "FRICTION.md")
def extract_open_section(text):
"""Return the body between '## Open signals' and the next '## ' heading."""
lines = text.splitlines()
start = None
for i, line in enumerate(lines):
if line.strip().lower() == "## open signals":
start = i + 1
break
if start is None:
return ""
end = len(lines)
for j in range(start, len(lines)):
if lines[j].startswith("## "):
end = j
break
return "\n".join(lines[start:end])
def split_signals(section):
"""Split the Open-signals body into raw per-signal blocks.
A signal starts with a top-level '- ' bullet; indented or blank lines are
continuations. Returns a list of multi-line strings with the leading '- '
stripped from the first line."""
signals = []
current = None
for line in section.splitlines():
if line.startswith("- "):
if current is not None:
signals.append("\n".join(current).strip())
current = [line[2:]]
elif current is not None:
if line.strip() == "" or line.startswith(" "):
current.append(line.strip())
else:
signals.append("\n".join(current).strip())
current = None
if current is not None:
signals.append("\n".join(current).strip())
return [s for s in signals if s]
if __name__ == "__main__": # pragma: no cover (filled in Task 4)
pass
```
- [ ] **Step 4: Run test to verify it passes**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
Expected: PASS (2 tests).
- [ ] **Step 5: Commit**
```bash
git add scripts/friction-scan.py tests/test_friction_scan.py
git commit -m "feat(kaizen): friction-scan section extraction + signal split"
```
---
## Task 2: Per-signal fields — tag, first_seen, age_days
**Files:**
- Modify: `scripts/friction-scan.py`
- Test: `tests/test_friction_scan.py`
- [ ] **Step 1: Write the failing test**
```python
import datetime
TODAY = datetime.date(2026, 6, 15)
def test_parse_signal_extracts_tag_and_date_and_age():
raw = fs.split_signals(fs.extract_open_section(SAMPLE))[0]
sig = fs.parse_signal(raw, TODAY)
assert sig["tag"] == "gotcha"
assert sig["first_seen"] == "2026-06-01"
assert sig["age_days"] == 14
assert "First thing" in sig["text"]
def test_parse_signal_handles_missing_date():
sig = fs.parse_signal("`[unused]` **No date here** something", TODAY)
assert sig["tag"] == "unused"
assert sig["first_seen"] is None
assert sig["age_days"] is None
```
- [ ] **Step 2: Run test to verify it fails**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py::test_parse_signal_extracts_tag_and_date_and_age -v`
Expected: FAIL — `parse_signal` undefined.
- [ ] **Step 3: Write minimal implementation**
Add near the top (after imports):
```python
TAG_RE = re.compile(r"`\[(friction|gotcha|recurring|unused)\]`")
DATE_RE = re.compile(r"(\d{4})-(\d{2})-(\d{2})")
```
Add the function (above the `__main__` block):
```python
def parse_signal(raw, today):
"""Turn one raw signal block into a structured dict."""
tag_m = TAG_RE.search(raw)
date_m = DATE_RE.search(raw)
if date_m:
first_seen = date_m.group(0)
seen = datetime.date(int(date_m.group(1)), int(date_m.group(2)), int(date_m.group(3)))
age_days = (today - seen).days
else:
first_seen = None
age_days = None
return {
"tag": tag_m.group(1) if tag_m else None,
"first_seen": first_seen,
"age_days": age_days,
"recurrence_count": 1, # refined in Task 3
"referenced_paths": [], # filled in Task 3
"still_exists": True, # filled in Task 3
"text": " ".join(raw.split()),
}
```
- [ ] **Step 4: Run test to verify it passes**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
Expected: PASS (4 tests).
- [ ] **Step 5: Commit**
```bash
git add scripts/friction-scan.py tests/test_friction_scan.py
git commit -m "feat(kaizen): parse tag/first_seen/age per signal"
```
---
## Task 3: Recurrence count + referenced paths + still_exists
**Files:**
- Modify: `scripts/friction-scan.py`
- Test: `tests/test_friction_scan.py`
- [ ] **Step 1: Write the failing test**
```python
def test_recurrence_from_ordinal():
assert fs.parse_recurrence("blah 5th occurrence (06-05/06/06) blah") == 5
def test_recurrence_from_datelist_when_no_ordinal():
# three slash-separated date fragments → recurrence 3
assert fs.parse_recurrence("recurred (06-05/06-09/06-10) again") == 3
def test_recurrence_defaults_to_one():
assert fs.parse_recurrence("a one-off gotcha") == 1
def test_parse_paths_picks_repo_paths_only():
paths = fs.parse_paths("see `scripts/repo-scan.py` and `latest` and `foo.yml`")
assert "scripts/repo-scan.py" in paths
assert "foo.yml" in paths
assert "latest" not in paths
def test_still_exists_false_for_missing_path():
sig = fs.parse_signal("`[unused]` **x** (2026-06-01): `scripts/nope-not-real.py`", TODAY)
assert sig["still_exists"] is False
def test_still_exists_true_for_real_path():
sig = fs.parse_signal("`[gotcha]` **x** (2026-06-01): `scripts/repo-scan.py`", TODAY)
assert sig["still_exists"] is True
```
- [ ] **Step 2: Run test to verify it fails**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -k "recurrence or paths or still_exists" -v`
Expected: FAIL — `parse_recurrence` / `parse_paths` undefined.
- [ ] **Step 3: Write minimal implementation**
Add the regexes near the others:
```python
ORDINAL_RE = re.compile(r"(\d+)(?:st|nd|rd|th)\s+(?:occurrence|reinforcement|time)", re.I)
DATELIST_RE = re.compile(r"\((\d{2}-\d{2}(?:/[\d/-]+)+)\)")
BACKTICK_RE = re.compile(r"`([^`]+)`")
PATH_EXTS = (".py", ".yml", ".yaml", ".md", ".sh", ".tf", ".j2", ".toml", ".cfg", ".hcl")
```
Add the helpers (above `parse_signal`):
```python
def parse_recurrence(text):
"""Best-effort recurrence count from explicit markers; default 1."""
counts = [1]
m = ORDINAL_RE.search(text)
if m:
counts.append(int(m.group(1)))
dl = DATELIST_RE.search(text)
if dl:
counts.append(dl.group(1).count("/") + 1)
return max(counts)
def parse_paths(text):
"""Backtick tokens that look like repo paths (contain '/' or a known ext)."""
out, seen = [], set()
for m in BACKTICK_RE.finditer(text):
tok = m.group(1).strip()
if ("/" in tok or tok.endswith(PATH_EXTS)) and tok not in seen:
seen.add(tok)
out.append(tok)
return out
```
Then update `parse_signal` — replace the three placeholder fields:
```python
paths = parse_paths(raw)
still_exists = all(os.path.exists(os.path.join(REPO_ROOT, p)) for p in paths) if paths else True
return {
"tag": tag_m.group(1) if tag_m else None,
"first_seen": first_seen,
"age_days": age_days,
"recurrence_count": parse_recurrence(raw),
"referenced_paths": paths,
"still_exists": still_exists,
"text": " ".join(raw.split()),
}
```
- [ ] **Step 4: Run test to verify it passes**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
Expected: PASS (10 tests).
- [ ] **Step 5: Commit**
```bash
git add scripts/friction-scan.py tests/test_friction_scan.py
git commit -m "feat(kaizen): recurrence count + referenced-path existence"
```
---
## Task 4: CLI — `load_signals`, `--json`, `--nudge`
**Files:**
- Modify: `scripts/friction-scan.py`
- Test: `tests/test_friction_scan.py`
- [ ] **Step 1: Write the failing test**
```python
def test_nudge_line_overdue_on_recurrence():
sigs = [{"age_days": 2, "recurrence_count": 5}]
line = fs.nudge_line(sigs)
assert "OVERDUE" in line
assert "max recurrence 5x" in line
def test_nudge_line_ok_when_quiet():
sigs = [{"age_days": 3, "recurrence_count": 1}, {"age_days": 1, "recurrence_count": 1}]
line = fs.nudge_line(sigs)
assert "ok" in line
assert "OVERDUE" not in line
def test_nudge_line_overdue_on_count():
sigs = [{"age_days": 1, "recurrence_count": 1} for _ in range(8)]
assert "OVERDUE" in fs.nudge_line(sigs)
def test_load_signals_reads_real_friction_file():
path = os.path.join(os.path.dirname(__file__), "..", "docs", "FRICTION.md")
sigs = fs.load_signals(path, TODAY)
assert len(sigs) >= 1
assert all(s["tag"] in {"friction", "gotcha", "recurring", "unused"} for s in sigs)
```
- [ ] **Step 2: Run test to verify it fails**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -k "nudge or load_signals" -v`
Expected: FAIL — `nudge_line` / `load_signals` undefined.
- [ ] **Step 3: Write minimal implementation**
Add thresholds near the top (after `FRICTION = ...`):
```python
# Nudge thresholds (tunable; the /kaizen self-eval phase revisits these).
NUDGE_MIN_OPEN = 8
NUDGE_MAX_AGE_DAYS = 21
NUDGE_MIN_RECURRENCE = 3
```
Add the functions and replace the `__main__` block:
```python
def load_signals(path, today):
with open(path, encoding="utf-8") as fh:
text = fh.read()
return [parse_signal(s, today) for s in split_signals(extract_open_section(text))]
def nudge_line(signals):
n = len(signals)
ages = [s["age_days"] for s in signals if s.get("age_days") is not None]
oldest = max(ages) if ages else 0
max_rec = max((s["recurrence_count"] for s in signals), default=0)
overdue = n >= NUDGE_MIN_OPEN or oldest >= NUDGE_MAX_AGE_DAYS or max_rec >= NUDGE_MIN_RECURRENCE
status = "OVERDUE — run /kaizen" if overdue else "ok"
return f"kaizen: {n} open signals, oldest {oldest}d, max recurrence {max_rec}x — {status}"
def main():
parser = argparse.ArgumentParser(description="Parse FRICTION.md Open signals for /kaizen.")
parser.add_argument("--nudge", action="store_true", help="print a one-line overdue summary")
parser.add_argument("--today", help="override today's date (YYYY-MM-DD) for testing")
parser.add_argument("--file", default=FRICTION, help="path to FRICTION.md")
args = parser.parse_args()
if args.today:
y, m, d = args.today.split("-")
today = datetime.date(int(y), int(m), int(d))
else:
today = datetime.date.today()
signals = load_signals(args.file, today)
if args.nudge:
print(nudge_line(signals))
else:
print(json.dumps(signals, indent=2))
if __name__ == "__main__":
main()
```
- [ ] **Step 4: Run tests + smoke-test the CLI**
Run: `.venv/bin/python -m pytest tests/test_friction_scan.py -v`
Expected: PASS (14 tests).
Run: `python3 scripts/friction-scan.py --nudge`
Expected: one line like `kaizen: 13 open signals, oldest 14d, max recurrence 5x — OVERDUE — run /kaizen`.
Run: `python3 scripts/friction-scan.py | head -20`
Expected: a JSON array of signal objects.
- [ ] **Step 5: Commit**
```bash
git add scripts/friction-scan.py tests/test_friction_scan.py
git commit -m "feat(kaizen): friction-scan CLI (--json default, --nudge)"
```
---
## Task 5: The `/kaizen` command document
**Files:**
- Create: `.claude/commands/kaizen.md`
- [ ] **Step 1: Create the command file**
Write `.claude/commands/kaizen.md` with exactly this content:
````markdown
# Kaizen — curate the friction log into improvements
Consume the **Open signals** in `docs/FRICTION.md`: decide a verdict for each, migrate
durable knowledge into the right docs, and archive consumed signals into the decisions
ledger. **Curate-only** — do not hunt for new signals; capture stays manual. This is an
interactive, judgment-dense pass: propose, the operator decides, you apply on approval.
Design: `docs/superpowers/specs/2026-06-14-kaizen-command-design.md`.
## Phase 0 — scan
Run `python3 scripts/friction-scan.py > /tmp/kaizen.json`. It returns each Open signal as
`{tag, first_seen, age_days, recurrence_count, referenced_paths, still_exists, text}`.
Treat `still_exists: false` as a hint the signal may already be resolved.
## Phase 1 — triage
Order signals by `recurrence_count` desc, then `age_days` desc, then tag. **Group signals
that share a root cause** and curate them together. Present the agenda before editing
anything: total open, how many recurring (≥3), how many look already-resolved.
## Phase 2 — per-signal curation (interactive)
For each signal/group, present: a one-line restatement, the evidence (age, recurrence,
still-real), and a proposed **verdict**. Verdicts:
- **SYSTEMATIZE** — migrate the durable lesson into its right home (a runbook, an ADR,
`CLAUDE.md`, a new `scripts/repo-scan.py` check, or a hook).
- **CHANGE** — adjust an existing tool/convention/config rather than document it.
- **PARK***out-of-phase but not obsolete*. Remove from the active tree, but write a
ledger row recording **where it now lives (git SHA/branch/doc) and a resurrection
trigger**. The default for "not touched lately but not wrong."
- **REMOVE***obsolete*: superseded, wrong, never worked, duplicated. Ledger row states
why.
- **ALREADY-BUILT** — the systematization already exists / the fix landed; archive.
- **ACCEPTED** — conscious no-op (revisit-if-recurs); archive.
- **KEEP-OPEN** — still accruing, not ripe; leave it in *Open signals* (no ledger row).
Rules:
- **Knowledge is never removed** — SYSTEMATIZE/migrate it; only *active surface* (scripts,
checks, conventions, plugins) is parked/removed.
- Every reductive verdict must classify *why unused*: **obsolete → REMOVE**,
**out-of-phase → PARK**.
- The operator approves / modifies / rejects each verdict. On approval: do the mechanical
edit (migrate text into the target doc; **move the signal from *Open signals* into the
ledger table**; delete the parked/removed file) and show the diff.
- PARK and REMOVE both delete from the active tree — the difference is the ledger row.
Git history + the ledger row are the park mechanism; never create a `parked/` directory.
## Phase 3 — close-out
- Add a new dated block under `## Kaizen reviews — decisions ledger` (newest first), same
shape as the existing block: a table with columns **Signal (first seen) | Verdict |
Resolution / where it lives now**.
- **Bias-to-remove discipline check:** if every verdict this pass was SYSTEMATIZE/CHANGE
(only accreting), say so explicitly.
- **Self-eval (light):** is `/kaizen` being run often enough (oldest consumed age)? Should
the nudge thresholds in `scripts/friction-scan.py` change? Note it.
- Run `make lint` if any code/docs changed; revert anything that breaks it.
- Commit per `CLAUDE.md` git conventions (one logical unit — straight to `main` if
small/safe, a branch if sweeping; show the diff first for a branch).
- Print a one-line summary: `consumed X · parked Y · removed Z · kept-open W · migrated → <docs>`.
## Headless / cron (future)
Deferred until the notify + cron stack exists (`docs/TODO.md` 11.3). When run
non-interactively, **report only**: print the proposed verdicts and the nudge, do not edit
or commit.
````
- [ ] **Step 2: Verify it parses against the real log**
Run: `python3 scripts/friction-scan.py --today 2026-06-15 | python3 -c "import sys,json; print(len(json.load(sys.stdin)), 'signals')"`
Expected: prints a non-zero signal count with no traceback.
- [ ] **Step 3: Lint**
Run: `make lint`
Expected: passes (markdown isn't linted by yamllint/ansible-lint, but this confirms nothing else broke).
- [ ] **Step 4: Commit**
```bash
git add .claude/commands/kaizen.md
git commit -m "feat(kaizen): /kaizen command — interactive friction curation"
```
---
## Task 6: Stage-2 nudge in `/review-repo` + STATUS/TODO
**Files:**
- Modify: `.claude/commands/review-repo.md`
- Modify: `STATUS.md`
- Modify: `docs/TODO.md`
- [ ] **Step 1: Add the nudge to the review-repo command**
In `.claude/commands/review-repo.md`, find the "Phase 0 — deterministic pre-scan" section
(it runs `scripts/repo-scan.py`). Immediately after that paragraph, add:
```markdown
Also run `python3 scripts/friction-scan.py --nudge` and include its one-line output in the
report's summary — it flags when the kaizen loop (`/kaizen`) is overdue (recurring signals,
backlog size, or age). This is a reminder only; do not act on `FRICTION.md` from here.
```
- [ ] **Step 2: Add a STATUS row**
In `STATUS.md`, under "Real and working today", add a row:
```markdown
| `/kaizen` | Curate `docs/FRICTION.md` Open signals → decisions ledger (`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`). On-demand; `--nudge` surfaces in `/review-repo`. Headless/cron deferred (TODO 11.3). |
```
- [ ] **Step 3: Update TODO 11**
In `docs/TODO.md` item 11, mark sub-item 1 built:
Change `1. Build `/retro`: ...` to begin with `1. ~~Build `/retro``... ` — i.e. strike it
through and append: `DONE — built as `/kaizen` (scope narrowed to curate-only per the
2026-06-14 spec; `/retro` name dropped). `scripts/friction-scan.py` + `.claude/commands/kaizen.md`.`
- [ ] **Step 4: Lint**
Run: `make lint`
Expected: passes.
- [ ] **Step 5: Commit**
```bash
git add .claude/commands/review-repo.md STATUS.md docs/TODO.md
git commit -m "feat(kaizen): nudge in /review-repo; STATUS + TODO"
```
---
## Task 7: Dogfood — first real `/kaizen` run
This task is **not** automated; it is the first real use, done interactively with the operator.
- [ ] **Step 1:** Run `/kaizen` against the current Open signals (there are several,
including the 3 added 2026-06-14 and the 5× execution-mode-menu signal).
- [ ] **Step 2:** Work the interactive curation (Phase 2) with the operator, applying
verdicts on approval.
- [ ] **Step 3:** Confirm the close-out: ledger updated, `make lint` green, summary printed.
This both processes the backlog and validates the command end-to-end.
---
## Self-review notes (author)
- **Spec coverage:** scope-A curate-only → Task 5 Phase 02; verdict model incl. PARK →
Task 5 Phase 2 + ledger; single source FRICTION.md → Task 4 `load_signals`; interactive
apply (B) → Task 5; ledger format → Task 5 Phase 3; scanner schema → Tasks 24; nudge +
thresholds → Task 4 + Task 6; out-of-scope items → not built (correct); `/review-repo`
relationship → Task 6 nudge. All covered.
- **No placeholders:** every code step shows complete code; the command doc is written in
full.
- **Type consistency:** the signal dict keys (`tag, first_seen, age_days,
recurrence_count, referenced_paths, still_exists, text`) are identical across Tasks 24
and the command doc; `nudge_line` reads `age_days`/`recurrence_count` only.

View file

@ -0,0 +1,146 @@
# M4a — Docker + Caddy reverse proxy (platform) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use checkbox (`- [ ]`) syntax.
**Goal:** Stand up the platform NetBird needs — Docker on askari + boma's standard Caddy reverse proxy with Gandi DNS-01 wildcard certs — proven end-to-end by serving a test route over TLS.
**Architecture:** `docker_host` installs Docker engine + compose (pinned). A custom Caddy image (`xcaddy` + `caddy-dns/gandi`) gives DNS-01 via `vault.gandi.pat`. The `reverse_proxy` role renders a Caddyfile from `reverse_proxy__routes` data + an `.env`. The M2 Hetzner firewall opens 80/443; `public_dns` publishes `*.askari.wingu.me`. M4b adds NetBird as a route.
**Tech Stack:** Docker CE, Caddy (custom xcaddy build), ACME DNS-01 (Gandi), Ansible, Terraform (hcloud firewall).
**Spec:** `docs/superpowers/specs/2026-06-14-netbird-coordinator-m4-design.md`
**Execution context:** Tasks author here; **Task 7 applies live to askari + issues a real cert** (gated). The custom image builds with Docker (available).
---
### Task 1: ADR — boma's reverse proxy is Caddy
- [ ] **Step 1:** Create `docs/decisions/024-reverse-proxy.md` following ADR-023's
structure (Status: Accepted; Context; Decision; Consequences; Related). Decision:
**Caddy** is boma's reverse proxy (rationale from the M4 spec Decision 1: Ansible-rendered
config fits Caddy not Traefik's discovery; automatic HTTPS + Gandi DNS-01; simpler at
this scale; `forward_auth` to Authentik preserved). Note it amends the soft Traefik
assumption in the roadmap/ADR-017 prose (no prior ADR pinned Traefik).
- [ ] **Step 2:** Add the ADR-024 row to `CLAUDE.md`'s Further-reading table and update
the roadmap Phase-2 "auth + reverse proxy" line (Authentik + **Caddy**, not Traefik).
- [ ] **Step 3:** `make lint`; commit `docs(adr): ADR-024 — Caddy is boma's reverse proxy`.
---
### Task 2: `docker_host` — install Docker engine
**Files:** `roles/docker_host/{defaults,tasks}/main.yml`, `roles/docker_host/README.md`.
- [ ] **Step 1:** `defaults/main.yml``docker_host__compose_version`-style pins (use the
Docker apt repo; pin via apt or accept repo latest with a comment). Variables:
`docker_host__packages: [docker-ce, docker-ce-cli, containerd.io, docker-compose-plugin]`.
- [ ] **Step 2:** `tasks/main.yml` — add the Docker apt repo + GPG key (`ansible.builtin.apt_key`/`deb822_repository`),
`apt` install `docker_host__packages`, enable+start `docker`. (Tag: role-name; concern `packages`.)
- [ ] **Step 3:** Fill `README.md` (purpose, vars). `make lint`.
- [ ] **Step 4:** Molecule: converge installs Docker; verify `docker --version` + service active. (`make test ROLE=docker_host`; build the image if needed.)
- [ ] **Step 5:** Commit `feat(docker_host): install Docker engine + compose plugin`.
---
### Task 3: Custom Caddy image (xcaddy + caddy-dns/gandi)
**Files:** `.docker/caddy-gandi/Dockerfile`, `Makefile` (a `caddy-image` target).
- [ ] **Step 1:** `.docker/caddy-gandi/Dockerfile` (verify the latest stable Caddy + plugin tags per ADR-014):
```dockerfile
FROM caddy:2-builder AS build
RUN xcaddy build --with github.com/caddy-dns/gandi
FROM caddy:2
COPY --from=build /usr/bin/caddy /usr/bin/caddy
```
- [ ] **Step 2:** `Makefile` — add `caddy-image` (build, tagged for the Forgejo registry like the Molecule image) + `caddy-image-push`. Add to `.PHONY` + help.
- [ ] **Step 3:** Build it: `make caddy-image`; verify `docker run --rm <img> caddy list-modules | grep dns.providers.gandi`. Expected: the module is listed.
- [ ] **Step 4:** Commit `feat(docker): custom Caddy image with the Gandi DNS-01 plugin`.
---
### Task 4: `reverse_proxy` role (Caddy)
**Files:** create `roles/reverse_proxy/{defaults,tasks}/main.yml`, `templates/{docker-compose.yml.j2,Caddyfile.j2,env.j2}`, `README.md`; `inventories/production/group_vars/all/reverse_proxy.yml`.
- [ ] **Step 1:** `group_vars/all/reverse_proxy.yml` — route data:
```yaml
reverse_proxy__image: "<forgejo-registry>/sjat/caddy-gandi:latest"
reverse_proxy__base_dir: /opt/services/reverse_proxy
reverse_proxy__acme_domain: askari.wingu.me # wildcard *.askari.wingu.me
reverse_proxy__routes: [] # M4b appends: {host: netbird.askari.wingu.me, upstream: "netbird-dashboard:80"}
```
- [ ] **Step 2:** `templates/Caddyfile.j2` — global TLS via Gandi DNS-01 + a per-route block:
```
{
email admin@wingu.me
}
*.{{ reverse_proxy__acme_domain }} {
tls {
dns gandi {env.GANDI_BEARER_TOKEN}
}
{% for r in reverse_proxy__routes %}
@{{ r.host | replace('.', '_') }} host {{ r.host }}
handle @{{ r.host | replace('.', '_') }} {
reverse_proxy {{ r.upstream }}
}
{% endfor %}
handle {
respond "boma reverse proxy" 200
}
}
```
- [ ] **Step 3:** `templates/env.j2``GANDI_BEARER_TOKEN={{ vault.gandi.pat }}`.
- [ ] **Step 4:** `templates/docker-compose.yml.j2` — the Caddy service (image `reverse_proxy__image`, ports 80:80 + 443:443, env_file, volumes for the Caddyfile + cert data, restart unless-stopped).
- [ ] **Step 5:** `tasks/main.yml` — ADR-004 deploy mechanics: ensure `base_dir`, render compose+Caddyfile+env, `community.docker.docker_compose_v2` up. (Adds `community.docker` to `requirements.yml` with the on-demand comment.)
- [ ] **Step 6:** `README.md`; `make lint`.
- [ ] **Step 7:** Molecule (render-only): converge renders the files (compose `apply:false`-style or skip the up in container); verify `caddy validate --config Caddyfile` passes. Commit `feat(reverse_proxy): Caddy role (Gandi DNS-01, route catalog)`.
---
### Task 5: Open the firewall (TF) + DNS
- [ ] **Step 1:** In `terraform/modules/hetzner_vm/main.tf`, add Caddy ports to the firewall (variable-driven so other hosts differ): inbound **80/tcp** + **443/tcp** from `0.0.0.0/0` + **3478/udp** (NetBird, M4b uses it) — gate behind a `var.public_web` bool defaulting false; set true for askari in `environments/offsite/main.tf`. `terraform fmt`.
- [ ] **Step 2:** `make tf-plan TF_ENV=offsite` (review: firewall adds 80/443[/3478]) → **gated** `make tf-apply TF_ENV=offsite`.
- [ ] **Step 3:** Add `*.askari.wingu.me` A → askari's IP to `public_dns__records` (`group_vars/all/public_dns.yml`); `make deploy PLAYBOOK=dns`; `dig +short test.askari.wingu.me` → askari IP.
- [ ] **Step 4:** Commit the TF + DNS changes.
---
### Task 6: Playbook wiring
- [ ] **Step 1:** Create `playbooks/offsite.yml` targeting `offsite_hosts`: roles `docker_host` then `reverse_proxy` (each with its role-name tag). `make lint` (check-tags verifies the role-name tags).
- [ ] **Step 2:** Commit `feat(offsite): playbook applying docker_host + reverse_proxy to askari`.
---
### Task 7: Apply to askari + prove TLS (gated, live)
> Live on askari. Issues a **real cert** via DNS-01. `rbw` unlocked.
- [ ] **Step 1:** `make check PLAYBOOK=offsite LIMIT=askari` — review.
- [ ] **Step 2:** `make deploy PLAYBOOK=offsite LIMIT=askari` — Docker installs, Caddy comes up.
- [ ] **Step 3:** Prove it (from ubongo): `curl -sSI https://test.askari.wingu.me``HTTP/2 200` with a **valid Let's Encrypt cert** (the wildcard `*.askari.wingu.me` issued via Gandi DNS-01). `curl -s https://test.askari.wingu.me``boma reverse proxy`.
- [ ] **Step 4:** `.venv/bin/ansible offsite_hosts -b -m command -a 'docker compose -f /opt/services/reverse_proxy/docker-compose.yml ps'` → Caddy healthy.
- [ ] **Step 5:** No repo commit (host state).
---
### Task 8: Docs
- [ ] **Step 1:** STATUS.md — Docker on askari + the `reverse_proxy` (Caddy) role built + applied; `*.askari.wingu.me` cert live. ROADMAP M4 — note M4a done, M4b (NetBird) next.
- [ ] **Step 2:** `make lint`; commit.
---
## Self-Review (completed)
- **Spec coverage:** Caddy-as-standard ADR (Decision 1) → Task 1; docker_host (Decision 4) →
Task 2; custom Caddy image + DNS-01 (Decisions 2) → Task 3; reverse_proxy role + route
catalog (Decision 4) → Task 4; firewall 80/443/3478 (Decision 5) → Task 5; DNS (Decision 6)
→ Task 5; live cert proof (testing) → Task 7. NetBird itself (Decisions 3,7,8) → **M4b**, correct.
- **Placeholder scan:** `<forgejo-registry>` is the known registry host (`forgejo.nyumbani.baobab.band/...`) — fill from the Molecule image var; not a logic gap. Version pins (Caddy, Docker, plugin) are flagged ADR-014 verifications, done in their tasks.
- **Name consistency:** `reverse_proxy__*`, `vault.gandi.pat``GANDI_BEARER_TOKEN`, `*.askari.wingu.me` used consistently across role, templates, firewall, and DNS.
- **Risk:** the custom image + DNS-01 is the novel bit — Task 3 verifies the module loads and Task 7 proves a real cert issues before M4b depends on it.

View file

@ -0,0 +1,91 @@
# M4b — NetBird coordinator (service role) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: superpowers:subagent-driven-development (recommended) or superpowers:executing-plans. Steps use `- [ ]` checkboxes.
**Goal:** Deploy the self-hosted NetBird control plane on askari as boma's first real service role (`netbird_coordinator`), fronted by the M4a Caddy, reachable at `https://netbird.askari.wingu.me` with the embedded Dex login.
**Architecture:** NetBird's own `configure.sh` generates the canonical compose + config for a pinned version; boma **captures that reference once and translates it into role templates** (ADR-004/013 — don't run their imperative script in production, render from templates). Runs in **external-reverse-proxy mode** (no bundled Traefik); Caddy adds a `netbird.askari.wingu.me` route. Secrets (datastore encryption key, TURN password, Dex secrets) are generated into vault; the setup key is stubbed `CHANGEME` for M5.
**Tech Stack:** NetBird (combined `netbird-server` container if stable for the pinned version, else the multi-container set), embedded Dex IdP, Coturn, Docker Compose, Caddy (M4a), Ansible.
**Spec:** `docs/superpowers/specs/2026-06-14-netbird-coordinator-m4-design.md` · **Prereq:** M4a (Docker + Caddy) ✓ on askari.
**Execution context:** Task 1 runs `configure.sh` in a scratch dir (capture only). Tasks 26 author. **Task 7 deploys live to askari** (gated). NetBird self-hosting is finicky — expect live debugging.
---
### Task 1: Capture NetBird's reference setup (pin the version)
- [ ] **Step 1:** Pick + pin the NetBird version (ADR-014 — check the latest stable release). Record it.
- [ ] **Step 2:** In a scratch dir (on ubongo, throwaway), fetch NetBird's `getting-started`/`configure.sh` for that version and run it with answers for: domain `netbird.askari.wingu.me`, **external reverse proxy** (disable bundled Traefik/Caddy), **embedded Dex** (no external SSO), Let's Encrypt off (Caddy terminates TLS).
- [ ] **Step 3:** Capture the generated files verbatim into the plan/notes: `docker-compose.yml`, `management.json` (or `config.yaml`), `turnserver.conf`, `openid-configuration.json`, dashboard env. Also capture NetBird's **Caddy external-proxy template** (their docs ship one) — it shows the exact upstreams + HTTP/2/gRPC routing the dashboard/management/signal/relay need.
- [ ] **Step 4:** No commit (reference capture; informs Tasks 24).
---
### Task 2: `netbird_coordinator` service role — templates
**Files:** `roles/netbird_coordinator/` (scaffold via `make new-role NAME=netbird_coordinator`): `defaults/main.yml`, `tasks/main.yml`, `templates/{docker-compose.yml,management.json,turnserver.conf,openid-configuration.json,dashboard.env}.j2`, `handlers/main.yml`, `README.md`.
- [ ] **Step 1:** Translate the captured compose into `templates/docker-compose.yml.j2` — containers, the shared `boma` Docker network (so Caddy reaches them by name), **no host port mappings except what Caddy/Coturn need** (Coturn 3478/udp; everything else internal, Caddy fronts it). Pin image tags (ADR-011).
- [ ] **Step 2:** Translate `management.json`/`config.yaml` into a template — fill `Datadir`, `DataStoreEncryptionKey` (`{{ vault.netbird.datastore_key }}`), `HttpConfig` (public URL `https://netbird.askari.wingu.me`), `TURNConfig` (coturn host + `{{ vault.netbird.turn_password }}`), `Signal`, `Relay`, `Store` (sqlite), and the embedded-Dex IdP block (DeviceAuthorizationFlow/PKCE, `openid-configuration.json` URL).
- [ ] **Step 3:** `turnserver.conf.j2` (realm = `netbird.askari.wingu.me`, the TURN secret), `openid-configuration.json.j2`, `dashboard.env.j2` (`NETBIRD_MGMT_API_ENDPOINT=https://netbird.askari.wingu.me`, the `AUTH_*` Dex values).
- [ ] **Step 4:** `defaults/main.yml` (`netbird__*` knobs: version, base_dir `/opt/services/netbird`, domain) + `tasks/main.yml` (ADR-004 deploy mechanics: ensure dir, render all files, `community.docker.docker_compose_v2` up; `netbird__manage` toggle for Molecule).
- [ ] **Step 5:** `make lint`; commit `feat(netbird): coordinator service role (compose + config templates)`.
---
### Task 3: Secrets (CHANGEME convention + generated)
- [ ] **Step 1:** Add to vault (`make edit-vault`): `vault.netbird.datastore_key`, `vault.netbird.turn_password`, any Dex client secret — **generate** strong values (or stub `CHANGEME` + a comment if operator-supplied). Add `vault.netbird.setup_key: CHANGEME` with a comment "created in the NetBird dashboard after first boot — M5 enrolment".
- [ ] **Step 2:** `make check-vault` confirms structure + lists the `setup_key` placeholder.
- [ ] **Step 3:** Commit the vault.
---
### Task 4: Wire Caddy + DNS
- [ ] **Step 1:** Append to `reverse_proxy__routes` (`group_vars/all/reverse_proxy.yml`): `{host: netbird.askari.wingu.me, upstream: "<netbird container:port>"}` — per the captured Caddy template (NetBird needs HTTP/2 + gRPC; add the required Caddy directives, e.g. separate handles for the management gRPC path if the template shows them).
- [ ] **Step 2:** `netbird.askari.wingu.me` already resolves via the `*.askari.wingu.me` wildcard (M4a) — no new DNS record.
- [ ] **Step 3:** Commit.
---
### Task 5: Service-role standard files (ADR-004, authored)
- [ ] **Step 1:** Author `roles/netbird_coordinator/SECURITY.md` (copy `docs/security/service-security-template.md`; record the public surface = Caddy 443 + Coturn 3478, embedded-Dex auth, accepted-risk R3).
- [ ] **Step 2:** `VERIFY.md` (copy the template; the `/verify-service` UI spec — run later when the playwright harness exists).
- [ ] **Step 3:** `ACCESS.md` (ADR-021; the dashboard/admin access + `access__*` intent).
- [ ] **Step 4:** `BACKUP.md` (ADR-022; the **datastore is stateful**`backup__*` data; record that off-site backup is **pending `fisi`** — an accepted risk for now).
- [ ] **Step 5:** `make lint`; commit `docs(netbird): service-role standard files (SECURITY/VERIFY/ACCESS/BACKUP)`.
---
### Task 6: Add netbird to the offsite playbook
- [ ] **Step 1:** In `playbooks/offsite.yml`, add `netbird_coordinator` after `reverse_proxy` (role-name tag). `make lint`. Commit.
---
### Task 7: Deploy to askari + verify (gated, live — expect debugging)
> NetBird self-hosting is finicky; budget for iterating on the management config + Caddy routing.
- [ ] **Step 1:** `make check PLAYBOOK=offsite LIMIT=askari TAGS=netbird` — review.
- [ ] **Step 2:** `make deploy PLAYBOOK=offsite LIMIT=askari TAGS=netbird``make deploy ... TAGS=reverse_proxy` (Caddy reloads with the netbird route).
- [ ] **Step 3:** Verify: `docker compose ps` all healthy; `curl -sI https://netbird.askari.wingu.me` → 200 with the M4a cert; the **dashboard loads** in a browser; the management API responds. Iterate on config/routing until green.
- [ ] **Step 4:** No repo commit (host state).
---
### Task 8: Docs
- [ ] **Step 1:** STATUS — `netbird_coordinator` built + applied (dashboard live); the first service role. ROADMAP M4b done; **M5 (enrol) next**. `make lint`; commit.
---
## Self-Review (completed)
- **Spec coverage:** external-proxy NetBird + embedded Dex (Decisions 3) → Tasks 1,2,4; first service role + standard files (Decision 7) → Tasks 2,5; firewall 3478 (Decision 5) → done in M4a; setup key M5 + CHANGEME (Decision 8) → Task 3; Caddy front (M4a) → Task 4. Enrolment → M5, correct.
- **Placeholder scan:** the concrete config field *values* are intentionally captured from `configure.sh` (Task 1) rather than invented — version-sensitive, and inventing them would be wrong. The plan pins the method, not guesses.
- **Risk:** NetBird's external-proxy + gRPC routing is the hard part — Task 1 captures NetBird's own Caddy template to get it right, and Task 7 budgets for live iteration.

View file

@ -0,0 +1,551 @@
# Public DNS (M1) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Build the `public_dns` role that manages `wingu.me`'s records at Gandi LiveDNS as code, purging Gandi's seeded defaults and applying boma's anti-spoof baseline.
**Architecture:** A control-node role drives `community.general.gandi_livedns` over declarative record lists in `group_vars/all/public_dns.yml` (mirroring the firewall-catalog pattern). Records to keep are `state: present`; Gandi's auto-seeded defaults are `state: absent`. A `public_dns__apply` toggle lets Molecule converge without calling the API; a pytest validates the data shape; the live run happens via `make check`/`deploy PLAYBOOK=dns` on ubongo.
**Tech Stack:** Ansible (`community.general.gandi_livedns`, PAT auth), pytest, Gandi LiveDNS API. Secrets from `vault.gandi.pat`.
**Spec:** `docs/superpowers/specs/2026-06-11-public-dns-gandi-migration-design.md`
**Execution context:** Tasks 16 + 8 are authoring (any machine with the venv). **Task 7 runs on ubongo** (has the vault + Gandi egress) and is the only one that touches live Gandi.
---
## File Structure
- `requirements.yml` (modify) — add `community.general` (≥9.0.0) for `gandi_livedns`.
- `roles/public_dns/` (create) — `defaults/main.yml`, `tasks/main.yml`, `meta/main.yml`, `README.md`, `molecule/default/`.
- `inventories/production/group_vars/all/public_dns.yml` (create) — `public_dns__domain` + `public_dns__records` (present) + `public_dns__absent` (Gandi defaults).
- `playbooks/dns.yml` (create) — control-node play running the role.
- `tests/test_public_dns.py` (create) — pytest over the record data.
- `docs/decisions/007-network.md`, `STATUS.md`, `docs/TODO.md`, `docs/CAPABILITIES.md` (modify) — doc reconciliation.
---
### Task 1: Add the `community.general` collection
**Files:**
- Modify: `requirements.yml`
- [ ] **Step 1: Add the collection with the on-demand comment**
In `requirements.yml`, under `collections:`, append:
```yaml
# community.general — gandi_livedns (public_dns role manages wingu.me at Gandi
# LiveDNS). PAT auth requires >= 9.0.0.
- name: community.general
version: ">=9.0.0"
```
- [ ] **Step 2: Install it**
Run: `make collections`
Expected: installs `community.general` (≥9.0.0) with no errors.
- [ ] **Step 3: Verify the module is available**
Run: `.venv/bin/ansible-doc community.general.gandi_livedns | head -5`
Expected: prints the module doc header (confirms the module resolves), mentioning `personal_access_token`.
- [ ] **Step 4: Commit**
```bash
git add requirements.yml
git commit -m "deps: add community.general for gandi_livedns (public_dns)"
```
---
### Task 2: Scaffold the role
**Files:**
- Create: `roles/public_dns/` (via the scaffolder)
- [ ] **Step 1: Scaffold**
Run: `make new-role NAME=public_dns`
Expected: `Role public_dns scaffolded at roles/public_dns/` (creates `tasks/`, `handlers/`, `defaults/`, `meta/`, `templates/`, `files/`, `molecule/default/`, `README.md`).
- [ ] **Step 2: Commit the scaffold**
```bash
git add roles/public_dns
git commit -m "scaffold(public_dns): empty role structure"
```
---
### Task 3: Record data + validation test (TDD)
**Files:**
- Test: `tests/test_public_dns.py`
- Create: `inventories/production/group_vars/all/public_dns.yml`
- [ ] **Step 1: Write the failing test**
Create `tests/test_public_dns.py`:
```python
import pathlib
import yaml
_DATA = (
pathlib.Path(__file__).resolve().parent.parent
/ "inventories" / "production" / "group_vars" / "all" / "public_dns.yml"
)
# Gandi auto-seeds these on a fresh .me zone; boma purges them (verified 2026-06-14).
GANDI_DEFAULTS_ABSENT = {
("@", "A"), ("www", "CNAME"), ("webmail", "CNAME"),
("gm1._domainkey", "CNAME"), ("gm2._domainkey", "CNAME"), ("gm3._domainkey", "CNAME"),
("_imap._tcp", "SRV"), ("_imaps._tcp", "SRV"), ("_pop3._tcp", "SRV"),
("_pop3s._tcp", "SRV"), ("_submission._tcp", "SRV"),
}
def _load():
return yaml.safe_load(_DATA.read_text())
def test_domain_is_wingu():
assert _load()["public_dns__domain"] == "wingu.me"
def test_present_records_well_formed():
for r in _load()["public_dns__records"]:
assert r["record"] and r["type"]
assert isinstance(r["values"], list) and r["values"]
def test_anti_spoof_baseline_present():
recs = {(r["record"], r["type"]): r["values"] for r in _load()["public_dns__records"]}
assert recs[("@", "MX")] == ["0 ."] # null MX
assert recs[("@", "TXT")] == ['"v=spf1 -all"'] # SPF deny-all
assert recs[("_dmarc", "TXT")] == ['"v=DMARC1; p=reject;"']
def test_gandi_defaults_marked_absent():
absent = {(r["record"], r["type"]) for r in _load()["public_dns__absent"]}
assert GANDI_DEFAULTS_ABSENT <= absent
def test_no_record_both_present_and_absent():
present = {(r["record"], r["type"]) for r in _load()["public_dns__records"]}
absent = {(r["record"], r["type"]) for r in _load()["public_dns__absent"]}
assert present.isdisjoint(absent)
def test_no_duplicate_present_records():
keys = [(r["record"], r["type"]) for r in _load()["public_dns__records"]]
assert len(keys) == len(set(keys))
```
- [ ] **Step 2: Run it to verify it fails**
Run: `.venv/bin/python -m pytest tests/test_public_dns.py -v`
Expected: FAIL (the data file does not exist yet — `FileNotFoundError`).
- [ ] **Step 3: Create the record data**
Create `inventories/production/group_vars/all/public_dns.yml`:
```yaml
---
# Public DNS — wingu.me at Gandi LiveDNS, managed by the public_dns role (M1).
# Mesh/LAN-only by default: only deliberate public records live here. PAT in
# vault.gandi.pat. See docs/decisions/007-network.md and the M1 spec.
public_dns__domain: wingu.me
# Present — anti-spoof baseline for a no-mail domain (overwrites Gandi's seeded mail set).
public_dns__records:
- { record: "@", type: MX, values: ["0 ."], ttl: 3600 }
- { record: "@", type: TXT, values: ['"v=spf1 -all"'], ttl: 3600 }
- { record: _dmarc, type: TXT, values: ['"v=DMARC1; p=reject;"'], ttl: 3600 }
# Service records appear as public-tier needs arise (askari A in M4).
# Mesh/LAN-only services never appear here.
# Absent — Gandi's auto-seeded defaults we don't want (purged once, idempotent thereafter).
public_dns__absent:
- { record: "@", type: A } # Gandi parking IP
- { record: www, type: CNAME } # Gandi web-redirect
- { record: webmail, type: CNAME } # Gandi webmail
- { record: gm1._domainkey, type: CNAME } # Gandi DKIM
- { record: gm2._domainkey, type: CNAME }
- { record: gm3._domainkey, type: CNAME }
- { record: _imap._tcp, type: SRV } # Gandi mail autodiscovery
- { record: _imaps._tcp, type: SRV }
- { record: _pop3._tcp, type: SRV }
- { record: _pop3s._tcp, type: SRV }
- { record: _submission._tcp, type: SRV }
```
- [ ] **Step 4: Run the test to verify it passes**
Run: `.venv/bin/python -m pytest tests/test_public_dns.py -v`
Expected: PASS (6 passed).
- [ ] **Step 5: Commit**
```bash
git add tests/test_public_dns.py inventories/production/group_vars/all/public_dns.yml
git commit -m "feat(public_dns): wingu.me record data + validation test"
```
---
### Task 4: Role implementation (defaults, tasks, meta, README)
**Files:**
- Modify: `roles/public_dns/defaults/main.yml`
- Modify: `roles/public_dns/tasks/main.yml`
- Modify: `roles/public_dns/meta/main.yml`
- Modify: `roles/public_dns/README.md`
- [ ] **Step 1: Write `defaults/main.yml`**
```yaml
---
# public_dns — manage the public zone at Gandi LiveDNS as code (M1).
# Record data (public_dns__domain / __records / __absent) lives in group_vars/all.
# See docs/decisions/007-network.md.
public_dns__apply: true # set false to validate without calling the Gandi API (Molecule)
public_dns__default_ttl: 1800 # TTL when a record omits one
public_dns__domain: "" # overridden in group_vars/all
public_dns__records: [] # present records
public_dns__absent: [] # records to remove
```
- [ ] **Step 2: Write `tasks/main.yml`**
```yaml
---
- name: Assert public DNS data is sane
ansible.builtin.assert:
that:
- public_dns__domain | length > 0
- public_dns__records | selectattr('type', 'equalto', 'MX') | list | length > 0
fail_msg: >-
public_dns__domain must be set and a null-MX anti-spoof record declared in
public_dns__records (group_vars/all/public_dns.yml).
run_once: true
- name: Ensure desired records are present (Gandi LiveDNS)
community.general.gandi_livedns:
domain: "{{ public_dns__domain }}"
record: "{{ item.record }}"
type: "{{ item.type }}"
values: "{{ item.values }}"
ttl: "{{ item.ttl | default(public_dns__default_ttl) }}"
state: present
personal_access_token: "{{ vault.gandi.pat }}"
loop: "{{ public_dns__records }}"
loop_control:
label: "{{ item.record }} {{ item.type }}"
run_once: true
when: public_dns__apply | bool
- name: Ensure unwanted records are absent (Gandi LiveDNS)
community.general.gandi_livedns:
domain: "{{ public_dns__domain }}"
record: "{{ item.record }}"
type: "{{ item.type }}"
state: absent
personal_access_token: "{{ vault.gandi.pat }}"
loop: "{{ public_dns__absent }}"
loop_control:
label: "{{ item.record }} {{ item.type }}"
run_once: true
when: public_dns__apply | bool
```
- [ ] **Step 3: Write `meta/main.yml`**
```yaml
---
galaxy_info:
author: sjat
description: Manage boma's public DNS zone (wingu.me) at Gandi LiveDNS as code.
license: MIT
min_ansible_version: "2.17"
platforms:
- name: Debian
versions:
- trixie
dependencies: []
```
- [ ] **Step 4: Write `README.md`**
```markdown
# public_dns
Manages boma's public DNS zone (**wingu.me**) at **Gandi LiveDNS** as code, via
`community.general.gandi_livedns` (PAT auth from `vault.gandi.pat`). Provider-agnostic
name on purpose. Run from the control node: `make check/deploy PLAYBOOK=dns`.
Mesh/LAN-only by default — only deliberate public records live in the zone (the
anti-spoof baseline now; `askari` in M4). Everything else is reached over LAN/mesh and
never appears here.
## Data (in `group_vars/all/public_dns.yml`)
| Var | Meaning |
|---|---|
| `public_dns__domain` | the zone (`wingu.me`) |
| `public_dns__records` | records to ensure **present** (`record`, `type`, `values`, optional `ttl`) |
| `public_dns__absent` | records to ensure **absent** (Gandi's auto-seeded defaults) |
## Behaviour knobs (`defaults/main.yml`)
| Var | Default | Meaning |
|---|---|---|
| `public_dns__apply` | `true` | set `false` to validate without calling the Gandi API (Molecule) |
| `public_dns__default_ttl` | `1800` | TTL when a record omits one |
## Notes
The zone is reconciled **additively** plus an explicit `absent` list (Gandi seeds 13
default records on a new `.me`; we purge the unwanted 11 and overwrite MX/SPF with the
anti-spoof baseline). Full-zone authoritative pruning is a future enhancement (TODO 8.3).
```
- [ ] **Step 5: Lint**
Run: `make lint`
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
- [ ] **Step 6: Commit**
```bash
git add roles/public_dns
git commit -m "feat(public_dns): role tasks, defaults, meta, README"
```
---
### Task 5: Molecule scenario (no live API)
**Files:**
- Modify: `roles/public_dns/molecule/default/converge.yml`
- Modify: `roles/public_dns/molecule/default/verify.yml`
- [ ] **Step 1: Write `converge.yml` (apply disabled, sample data)**
```yaml
---
- name: Converge
hosts: all
gather_facts: true
vars:
public_dns__apply: false # never call the Gandi API from a container
public_dns__domain: example.test
public_dns__records:
- { record: "@", type: MX, values: ["0 ."], ttl: 3600 }
- { record: "@", type: TXT, values: ['"v=spf1 -all"'], ttl: 3600 }
public_dns__absent:
- { record: www, type: CNAME }
roles:
- role: public_dns
```
- [ ] **Step 2: Write `verify.yml`**
```yaml
---
- name: Verify
hosts: all
gather_facts: false
tasks:
- name: Role variables resolved
ansible.builtin.assert:
that:
- public_dns__domain == "example.test"
- public_dns__apply | bool == false
msg: "public_dns defaults/vars did not resolve as expected"
tags: [verify]
```
- [ ] **Step 3: Run Molecule**
Run: `make test ROLE=public_dns`
Expected: PASS — converge applies the role (the `assert` passes; the `gandi_livedns` tasks are skipped because `public_dns__apply: false`), verify passes, idempotence clean.
- [ ] **Step 4: Commit**
```bash
git add roles/public_dns/molecule
git commit -m "test(public_dns): Molecule scenario (apply disabled, no live API)"
```
---
### Task 6: The `dns.yml` playbook
**Files:**
- Create: `playbooks/dns.yml`
- [ ] **Step 1: Write the play**
```yaml
---
# dns.yml — manage the public DNS zone (wingu.me) at Gandi LiveDNS as code.
# Runs on the control node (ubongo) against the Gandi API — no host config.
# Run: make check PLAYBOOK=dns then make deploy PLAYBOOK=dns
- name: Manage public DNS (Gandi LiveDNS)
hosts: control
connection: local
gather_facts: false
become: false
roles:
- role: public_dns
tags: [public_dns]
```
- [ ] **Step 2: Lint (verifies the role-name tag on the import)**
Run: `make lint`
Expected: `Passed: 0 failure(s)` and `check-tags: OK (... role imports verified)`.
- [ ] **Step 3: Commit**
```bash
git add playbooks/dns.yml
git commit -m "feat(public_dns): dns.yml play (control-node, Gandi LiveDNS)"
```
---
### Task 7: Live run on ubongo (purge + baseline) — gated
> **Runs on ubongo only** (vault + Gandi egress). `rbw unlock` first. This is the one
> task that mutates live Gandi; review the check-mode diff before deploying.
- [ ] **Step 1: Dry-run (check mode + diff)**
Run: `make check PLAYBOOK=dns`
Expected: the diff shows the 3 present records being set (null MX, SPF `-all`, DMARC `reject`) and the 11 Gandi defaults being removed. **Review it.**
- [ ] **Step 2: Apply**
Run: `make deploy PLAYBOOK=dns`
Expected: `changed` for the present + absent records; no errors.
- [ ] **Step 3: Verify idempotence**
Run: `make deploy PLAYBOOK=dns`
Expected: `ok=... changed=0` — a second run makes no changes.
- [ ] **Step 4: Verify with dig**
```bash
dig +short MX wingu.me # expect: 0 .
dig +short TXT wingu.me # expect: "v=spf1 -all"
dig +short TXT _dmarc.wingu.me # expect: "v=DMARC1; p=reject;"
dig +short www.wingu.me # expect: empty (CNAME removed)
```
Expected: as annotated (allow for TTL/propagation).
- [ ] **Step 5: No commit** — this task changes live Gandi, not the repo.
---
### Task 8: Documentation reconciliation
**Files:**
- Modify: `docs/decisions/007-network.md`
- Modify: `STATUS.md`
- Modify: `docs/TODO.md`
- Modify: `docs/CAPABILITIES.md`
- [ ] **Step 1: Amend ADR-007 — naming scheme row**
Replace the `Public service FQDN` row of the naming-scheme table:
```
| Public service FQDN | `<service>.baobab.band` | `forgejo.nyumbani.baobab.band` |
```
with:
```
| Public service FQDN | `<service>.wingu.me` | `vaultwarden.wingu.me` |
| Off-site (VPS) FQDN | `<service>.askari.wingu.me` | `netbird.askari.wingu.me` |
```
- [ ] **Step 2: Amend ADR-007 — public zone + scheme**
Replace the **Public zone** paragraph:
```
**Public zone**: `baobab.band` — served by external DNS (Cloudflare or equivalent).
Public-facing services resolve to the public IP or Cloudflare proxy.
```
with:
```
**Public zone**: `wingu.me` — Gandi LiveDNS, **managed as code** by the `public_dns`
role (`vault.gandi.pat`). Three-tier naming: infra `<host>.boma.wingu.me` (internal),
services `<service>.wingu.me` (split-horizon), off-site `<service>.askari.wingu.me`.
`nyumbani` is retired. **Mesh/LAN-only by default**: home services have no public record
(reached over LAN or the NetBird mesh); only deliberate exceptions are published.
The project is `boma`; the domain is `wingu.me` (see the M1 spec). The legacy
`baobab.band` zone (Cloudflare) is out of scope here.
```
- [ ] **Step 3: Update the split-horizon example**
In the **Split-horizon** paragraph, replace the example `forgejo.nyumbani.baobab.band`
with `vaultwarden.wingu.me` (internal → private proxy IP; public → only if a deliberate
exception). Leave the internal-zone (`boma.baobab.band` → to become `boma.wingu.me` when
the `dns` role lands in Phase 2) wording; add a parenthetical: *(internal zone is renamed
to `boma.wingu.me` when the `dns` role is built — Phase 2)*.
- [ ] **Step 4: Mark STATUS — public_dns built**
In `STATUS.md`, under "Real and working today", add a row:
```
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); purged Gandi's seeded defaults, applied the anti-spoof baseline (null MX, SPF `-all`, DMARC reject). Mesh/LAN-only default. M1 of the roadmap. |
```
- [ ] **Step 5: Resolve TODO 4**
In `docs/TODO.md`, change item 4 to struck-through/decided:
```
4. ~~**Split-horizon FQDN** — adopt split-horizon FQDN with or without nyumbani?~~
DECIDED (M1): three-tier scheme on `wingu.me`; `nyumbani` dropped; mesh/LAN-only
default. See `docs/decisions/007-network.md` + the M1 spec.
```
- [ ] **Step 6: Add a CAPABILITIES row**
In `docs/CAPABILITIES.md`, near the Internal DNS row, add:
```
| Public DNS | `public_dns` role → Gandi LiveDNS | P | core | wingu.me zone as code (ADR-007) | anti-spoof baseline; mesh/LAN-only |
```
(Match the surrounding table's column shape; adjust the status letter to the table's convention.)
- [ ] **Step 7: Lint + commit**
Run: `make lint`
Expected: clean.
```bash
git add docs/decisions/007-network.md STATUS.md docs/TODO.md docs/CAPABILITIES.md
git commit -m "docs(public_dns): amend ADR-007 to wingu.me/Gandi; resolve TODO 4; STATUS + CAPABILITIES"
```
---
## Self-Review (completed)
- **Spec coverage:** role + group_vars data (Decisions 4,5) → Tasks 3,4; `gandi_livedns` + PAT (Decision 5, Verified facts) → Task 4; collections-on-demand (Decision 5) → Task 1; anti-spoof baseline + Gandi-defaults purge (Problem, Data model) → Tasks 3,7; cert scope (Decision 6) → out of scope (no cert tasks, correct); testing (check-mode/idempotence/dig + pytest) → Tasks 5,7,3; ADR-007 amendment + TODO 4/O12 → Task 8. All covered.
- **Placeholder scan:** none — every code/content step is concrete.
- **Type/name consistency:** `public_dns__domain`/`__records`/`__absent`/`__apply`/`__default_ttl` and `vault.gandi.pat` used identically across data, role, play, and tests. `gandi_livedns` params match the verified module signature.
- **Note for the implementer:** Task 7 assumes ubongo. If the `gandi_livedns` `absent` call needs `values` for some record types, add them from `public_dns__absent` (verify against the pinned `community.general` version per ADR-014).

View file

@ -0,0 +1,234 @@
# M5 — Mesh enrollment (NetBird agents) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax.
**Goal:** `ubongo` reachable from anywhere over the NetBird mesh — enrol NetBird agents on `ubongo` + `askari` via a new opt-in `base` `mesh` concern; the operator enrols the laptops.
**Architecture:** A new `base` concern (`roles/base/tasks/mesh.yml`) installs a pinned NetBird agent and runs `netbird up` with a reusable scoped setup key from vault. Gated by `base__mesh_enabled` (per-host opt-in) and `base__mesh_manage` (skips network/daemon actions for Molecule). **No firewall change** — enrollment is additive (`wt0` comes up, SSH keeps listening), so there is zero lockout risk. The host nftables default-deny + NetBird ACL tightening are a separate, deferred follow-on.
**Tech Stack:** NetBird agent (apt, pinned), Ansible (`base` role), Molecule, the M4b coordinator at `https://netbird.askari.wingu.me`.
**Spec:** `docs/superpowers/specs/2026-06-17-m5-mesh-enrollment-design.md`
**Execution context:** Tasks 14 author + commit (need nothing from the operator). **Task 5 is an operator handoff** (dashboard `/setup` + mint key). **Task 6 applies live to `ubongo` + `askari`** (gated). Task 7 is operator-only (laptops). Task 8 docs.
---
## File structure
| File | Change | Responsibility |
|---|---|---|
| `tests/tags.yml` | modify | add the `mesh` concern to the closed tag vocabulary |
| `roles/base/defaults/main.yml` | modify | `base__mesh_*` knobs |
| `roles/base/tasks/mesh.yml` | **create** | the enrollment concern (install + `netbird up`) |
| `roles/base/tasks/main.yml` | modify | include `mesh.yml` (gated, tagged) |
| `roles/base/README.md` | modify | document the `mesh` concern + knobs |
| `roles/base/molecule/default/converge.yml` | modify | enable mesh (manage off) + dummy key |
| `roles/base/molecule/default/verify.yml` | modify | assert mesh wiring / no-op |
| `inventories/production/group_vars/control/vars.yml` | modify | `base__mesh_enabled: true` (ubongo) |
| `inventories/production/group_vars/offsite_hosts/vars.yml` | **create** | `base__mesh_enabled: true` (askari) |
| `inventories/production/group_vars/all/vault.yml` | modify (vault) | `vault.netbird.setup_key: CHANGEME` |
| `STATUS.md`, `docs/ROADMAP.md`, `docs/FRICTION.md` | modify | M5 done; deferred hardening; friction note |
---
### Task 1: Verify + pin the NetBird agent; add the `mesh` tag
- [ ] **Step 1 (ADR-014 verification — record the answers):** confirm against current NetBird docs/repo (WebFetch `docs.netbird.io`, `pkgs.netbird.io`):
- the **apt repo** URL + signing-key URL + suite/component (the install-script publishes an apt source — capture the exact `deb` line and key URL);
- the **package name** (headless agent — expected `netbird`) and that **version `0.72.4`** (matching the coordinator) is installable, plus the apt **version-pin syntax**;
- the exact **`netbird status`** output string that indicates an established management connection (for the idempotency guard — e.g. `Management: Connected`);
- the **`netbird up`** flags (`--management-url`, `--setup-key`);
- whether the pinned NetBird's **default peer policy is allow-by-default** (decides §Task 6 step 4). Record all of this in the commit message / a note block.
- [ ] **Step 2:** add `mesh` to `tests/tags.yml` under `concerns:`:
```yaml
- mesh # NetBird agent enrollment (ADR-016)
```
- [ ] **Step 3:** `make lint` → expect `check-tags: OK` (an unused vocab entry is allowed; nothing references it yet). Expected: 0 failures.
- [ ] **Step 4:** commit `feat(base): add the 'mesh' concern tag (NetBird agent, ADR-016)`.
---
### Task 2: `base` `mesh` concern — defaults + tasks + include + README
**Files:** `roles/base/defaults/main.yml`, `roles/base/tasks/mesh.yml` (create), `roles/base/tasks/main.yml`, `roles/base/README.md`.
- [ ] **Step 1:** append the knobs to `roles/base/defaults/main.yml`:
```yaml
# NetBird mesh agent enrollment (ADR-016). Opt-in: default off so applying `base` to a
# host not (yet) on the mesh is a no-op for this concern. The live actions (apt install
# over the network, `netbird up` against the coordinator) are additionally gated by
# base__mesh_manage so Molecule can exercise the wiring without a coordinator.
base__mesh_enabled: false
base__mesh_manage: true
base__mesh_management_url: "https://netbird.askari.wingu.me"
base__mesh_setup_key: "{{ vault.netbird.setup_key }}" # noqa: var-naming[no-role-prefix] is NOT needed — this carries the base__ prefix
base__mesh_version: "0.72.4" # match the coordinator; confirmed installable in Task 1
```
- [ ] **Step 2:** create `roles/base/tasks/mesh.yml` (use the Task-1-verified repo URL/key/pin; the values below are the expected ones to confirm):
```yaml
---
# NetBird agent enrollment (ADR-016). Additive only — no firewall change here.
- name: Ensure /etc/apt/keyrings exists
ansible.builtin.file:
path: /etc/apt/keyrings
state: directory
mode: "0755"
tags: [mesh]
- name: Add the NetBird APT GPG key
ansible.builtin.get_url:
url: https://pkgs.netbird.io/debian/public.key # confirm in Task 1
dest: /etc/apt/keyrings/netbird.asc
mode: "0644"
when: base__mesh_manage | bool
tags: [mesh]
- name: Add the NetBird APT repository
ansible.builtin.apt_repository:
repo: >-
deb [signed-by=/etc/apt/keyrings/netbird.asc]
https://pkgs.netbird.io/debian stable main # confirm in Task 1
filename: netbird
state: present
when: base__mesh_manage | bool
tags: [mesh]
- name: Install the NetBird agent (pinned)
ansible.builtin.apt:
name: "netbird={{ base__mesh_version }}" # confirm pin syntax in Task 1
state: present
update_cache: true
when: base__mesh_manage | bool
tags: [mesh]
- name: Check current NetBird connection status
ansible.builtin.command: netbird status
register: _netbird_status
changed_when: false
failed_when: false
when: base__mesh_manage | bool
tags: [mesh]
- name: Enrol this host in the mesh
ansible.builtin.command: >-
netbird up
--management-url {{ base__mesh_management_url }}
--setup-key {{ base__mesh_setup_key }}
register: _netbird_up
changed_when: _netbird_up.rc == 0
when:
- base__mesh_manage | bool
- "'Management: Connected' not in (_netbird_status.stdout | default(''))" # confirm string in Task 1
no_log: true # setup key is on the argv
tags: [mesh]
```
- [ ] **Step 3:** in `roles/base/tasks/main.yml`, add the include (after the existing concerns), gated by `base__mesh_enabled`:
```yaml
- name: NetBird mesh enrollment
ansible.builtin.include_tasks:
file: mesh.yml
apply:
tags: [mesh]
when: base__mesh_enabled | bool
tags: [mesh]
```
- [ ] **Step 4:** document the concern in `roles/base/README.md` (purpose; the `base__mesh_*` knobs table; that it is additive/no-firewall; that the setup key comes from `vault.netbird.setup_key`; the `enabled`/`manage` gating).
- [ ] **Step 5:** `make lint` → 0 failures. Commit `feat(base): NetBird agent enrollment concern (mesh)`.
---
### Task 3: Molecule coverage
**Files:** `roles/base/molecule/default/converge.yml`, `roles/base/molecule/default/verify.yml`.
> The concern is install + a daemon command needing a live coordinator, so the hermetic Molecule surface is thin (the known "render-only misses the real call" gotcha). Molecule proves: (a) enabling mesh with `manage: false` does not break the base converge and is idempotent; (b) `base__mesh_enabled: false` (the default, already exercised by the existing firewall test) is a clean no-op. Full install+enrol is proven live in Task 6.
- [ ] **Step 1:** in `converge.yml` add to `vars:`:
```yaml
base__mesh_enabled: true
base__mesh_manage: false # skip network/daemon actions
base__mesh_setup_key: "dummy-molecule-key"
```
- [ ] **Step 2:** in `verify.yml` add a task asserting the concern is a clean no-op under `manage: false``netbird` is NOT installed and `wt0` does not exist (since all live actions are gated off):
```yaml
- name: Confirm mesh manage=false did not install/enrol
ansible.builtin.command: which netbird
register: _nb
changed_when: false
failed_when: false
- name: Assert netbird absent under manage=false
ansible.builtin.assert:
that:
- _nb.rc != 0
fail_msg: "netbird should not be installed when base__mesh_manage is false"
```
- [ ] **Step 3:** `make test ROLE=base` → converge + idempotence + verify pass (`failed=0`). The existing firewall assertions still pass (mesh vars don't affect them).
- [ ] **Step 4:** commit `test(base): molecule coverage for the mesh concern (manage-off no-op)`.
---
### Task 4: Vault stub + per-host opt-in
- [ ] **Step 1 (vault — needs `rbw` unlocked):** `make decrypt FILE=inventories/production/group_vars/all/vault.yml`; add under `vault.netbird` (alongside `auth_secret`/`datastore_key`):
```yaml
# Reusable, scoped (group "boma-hosts"), expiring NetBird setup key. Mint it in the
# dashboard (Setup Keys) AFTER the first-boot /setup admin exists. Consumed by the
# base 'mesh' concern. CHANGEME until the operator supplies it via `make edit-vault`.
setup_key: CHANGEME
```
`make encrypt FILE=...`; `make check-vault` → confirms structure + lists the `setup_key` CHANGEME.
- [ ] **Step 2:** set the opt-in. In `inventories/production/group_vars/control/vars.yml` add `base__mesh_enabled: true` (ubongo). Create `inventories/production/group_vars/offsite_hosts/vars.yml`:
```yaml
---
# askari is a NetBird peer as well as the coordinator host (ADR-016).
base__mesh_enabled: true
```
- [ ] **Step 3:** `make lint` → 0 failures. Commit `feat(base): vault setup_key stub + enable mesh on ubongo + askari`.
---
### Task 5: Operator handoff — first-boot admin + setup key (GATED, operator does this)
> Nothing here is automatable — the agent cannot create a dashboard admin or mint a key.
- [ ] **Step 1 (operator):** browse `https://netbird.askari.wingu.me`, complete the one-time `/setup` to create the admin user, log in.
- [ ] **Step 2 (operator):** create a **reusable** setup key, **scoped** to auto-assign peers to a `boma-hosts` group, with an **expiry**. Copy the key value.
- [ ] **Step 3 (operator):** `make edit-vault` → replace `vault.netbird.setup_key`'s `CHANGEME` with the real key → `:wq` (re-encrypts) → `make check-vault` shows no outstanding CHANGEME. The key never enters the chat.
- [ ] **Step 4:** no repo commit beyond the (already-encrypted) vault, which is unchanged on disk structure.
---
### Task 6: Enrol `ubongo` + `askari` (GATED, live — needs Task 5 done + `rbw` unlocked)
- [ ] **Step 1:** `make check PLAYBOOK=site LIMIT=askari TAGS=mesh` — review (askari is `ansible`-user managed; cleaner first target than the control node). Then `make deploy PLAYBOOK=site LIMIT=askari TAGS=mesh`.
- [ ] **Step 2:** verify on askari: `netbird status` shows `Management: Connected`; `ip link show wt0` exists. (Agent coexists with the coordinator container; it reaches the coordinator via the public URL.)
- [ ] **Step 3:** `make check PLAYBOOK=site LIMIT=ubongo TAGS=mesh` — review. Note: ubongo is managed as `sjat` with `become: true` (same path `dev_env` used via `playbooks/workstation.yml`); confirm `sjat` sudo works (the run will prompt/fail clearly if a become password is needed). Then `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=mesh`.
- [ ] **Step 4:** verify the mesh link from ubongo: `netbird status` shows `ubongo` connected and lists `askari` as a peer; ping askari's NetBird (`100.x`) address. If the pinned NetBird is NOT allow-by-default (Task 1, Step 1), add one minimal dashboard policy permitting the admin group → `ubongo` SSH (or temporarily the default policy) so Task 7 can connect.
- [ ] **Step 5:** no repo commit (host state).
---
### Task 7: Enrol the road-warrior clients → goal lands (operator)
- [ ] **Step 1 (operator):** install the NetBird client on `mamba` + the work laptop; log in via the dashboard (Dex SSO) so they join the mesh.
- [ ] **Step 2 (operator):** from a laptop (anywhere), `ssh sjat@<ubongo-netbird-ip>` (or the mesh hostname) — connection succeeds. **← the mobile-access goal lands here.**
- [ ] **Step 3:** confirm with the operator that remote access works end-to-end.
---
### Task 8: Docs
- [ ] **Step 1:** `STATUS.md` — move "NetBird agent enrollment in `base`" to **built + applied** (ubongo + askari enrolled; reachability achieved). Note the `mesh` concern + opt-in. ubongo row: mesh-enrolled (its other base concerns still pending). askari row: NetBird peer.
- [ ] **Step 2:** `docs/ROADMAP.md`**M5 ✅ DONE**; Phase 1 (remote access) complete. Next: the **Procurement gate** (`/capacity-review` → buy cluster hardware). Record the deferred "mesh hardening" follow-on (ubongo nftables default-deny + NetBird ACL tightening + askari SSH→`wt0`).
- [ ] **Step 3:** `docs/FRICTION.md` — add a signal: a **docs-only commit still tripped the `rbw`-locked pre-commit guard** (2026-06-17), although the 2026-06-10 kaizen fix was meant to let docs-/config-only commits through without vault — the hook scoping or a blanket guard needs a look.
- [ ] **Step 4:** `make lint`; commit `docs: M5 done — Phase 1 remote access complete`.
---
## Self-Review (completed)
- **Spec coverage:** `mesh` concern (spec §1) → Tasks 13; vault stub (spec §2) → Task 4; ubongo+askari enrol (spec §3) → Tasks 4,6; laptops (spec §3) → Task 7; reachability via default policy (spec §4) → Task 6 step 4; deferred hardening (spec §6) → recorded in Task 8; operator handoff (spec) → Task 5. Testing (spec) → Task 3 (hermetic) + Task 6 (live). All covered.
- **Placeholder scan:** the "confirm in Task 1" markers are ADR-014 verification points executed in Task 1 (the repo URL/key/pin/status-string), not vague TODOs — Task 2's code carries the expected values to confirm, matching how M4a/M4b pinned versions in-plan.
- **Consistency:** `base__mesh_enabled` (opt-in) vs `base__mesh_manage` (test gate) used consistently across defaults, tasks, include, converge, and the no-op assertion; `vault.netbird.setup_key` matches between defaults, vault stub, and Task 5; `mesh` tag added (Task 1) before it is used (Task 2).
- **Risk:** the only live risk is Task 6 on the control node — mitigated because the `mesh` concern makes **no firewall change** (SSH stays open on all paths), askari is enrolled first as the lower-risk rehearsal, and the host nftables lockdown is explicitly out of scope.

View file

@ -0,0 +1,466 @@
# Mesh-hardening 1/3 — askari SSH onto wt0 — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make askari's SSH reachable only over the NetBird mesh (`wt0`) and close the WAN `:22` surface at both the host nftables layer and the Hetzner Cloud Firewall, without dropping askari's public services.
**Architecture:** Three enforcement layers — (1) sshd `ListenAddress` bound to the live `wt0` IP (fail-closed, `ip_nonlocal_bind` to beat the post-boot bind race); (2) the base role's catalog-driven nftables default-deny (SSH already restricted to `wt0` via `base__firewall_mgmt_interface`; add a `public` zone + askari service entries so 80/443/3478 survive); (3) Terraform drops the Hetzner Cloud Firewall WAN `:22` rule. Tasks 14 are code (subagent-driven, each Molecule/lint/plan-verified). Task 5 is the live, operator-supervised cutover on the real host.
**Tech Stack:** Ansible (role `base`, FQCN), nftables, Molecule on Debian 13, `ansible.posix.sysctl`, pytest (filter unit tests), Terraform (`hcloud` provider).
**Spec:** `docs/superpowers/specs/2026-06-17-mesh-hardening-askari-ssh-wt0-design.md`
**Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; `make tf-plan` before `make tf-apply`; never hand-edit the generated `offsite.yml`; rbw unlocked for commits touching ansible content.
---
### Task 1: base role — sshd `ListenAddress` on wt0 + `ip_nonlocal_bind` (fail-closed)
**Files:**
- Modify: `roles/base/defaults/main.yml`
- Modify: `roles/base/tasks/ssh.yml`
- Modify: `roles/base/templates/sshd_hardening.conf.j2`
- Modify: `roles/base/molecule/default/converge.yml` (fixture)
- Modify: `roles/base/molecule/default/verify.yml` (assertions = the test)
- [ ] **Step 1: Write the failing test (extend Molecule verify)**
In `roles/base/molecule/default/verify.yml`, add these tasks after the existing "Sshd drop-in present and config valid" block:
```yaml
- name: ListenAddress bound to the fixture mesh IP (mesh-only mode)
ansible.builtin.command: grep -q '^ListenAddress 100.99.0.1$' /etc/ssh/sshd_config.d/10-boma.conf
changed_when: false
- name: ip_nonlocal_bind sysctl drop-in is present
ansible.builtin.command: grep -q '^net.ipv4.ip_nonlocal_bind = 1' /etc/sysctl.d/60-boma-nonlocal-bind.conf
changed_when: false
- name: ip_nonlocal_bind is live in this netns
ansible.builtin.command: sysctl -n net.ipv4.ip_nonlocal_bind
register: _nonlocal
changed_when: false
failed_when: _nonlocal.stdout | trim != '1'
```
- [ ] **Step 2: Add the fixture that drives it (Molecule converge)**
In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (alongside the existing `base__mesh_*`):
```yaml
base__ssh_listen_mesh_only: true
base__ssh_listen_addr: "100.99.0.1" # fixture mesh IP (no wt0 in the container)
```
- [ ] **Step 3: Run the test to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL — converge errors or verify fails (`ListenAddress` not rendered; sysctl drop-in absent), because the feature isn't implemented yet.
- [ ] **Step 4: Add the defaults**
In `roles/base/defaults/main.yml`, after the `base__ssh_authorised_keys: []` line (end of the hardening block), add:
```yaml
# SSH listen-on-mesh (mesh-hardening 1/3, ADR-016/021). Opt-in: when true, sshd binds
# ListenAddress to this host's mesh IP only (not the WAN). The IP comes from the live wt0
# fact (ansible_facts.wt0.ipv4.address); base__ssh_listen_addr overrides it. ip_nonlocal_bind
# lets sshd bind the mesh IP before wt0 exists at boot. Fails closed: the play asserts a
# non-empty address rather than silently listening on all interfaces.
base__ssh_listen_mesh_only: false
base__ssh_listen_addr: ""
```
- [ ] **Step 5: Resolve + assert + sysctl in `ssh.yml`**
In `roles/base/tasks/ssh.yml`, insert these tasks at the TOP of the file (before "Ensure openssh-server is installed"):
```yaml
- name: Resolve the sshd mesh listen address (override, else live wt0 fact)
ansible.builtin.set_fact:
base__ssh_listen_addr_resolved: >-
{{ base__ssh_listen_addr
or ansible_facts.get('wt0', {}).get('ipv4', {}).get('address', '') }}
when: base__ssh_listen_mesh_only | bool
- name: Fail closed — refuse to render sshd without a known mesh address
ansible.builtin.assert:
that:
- base__ssh_listen_addr_resolved | length > 0
fail_msg: >-
base__ssh_listen_mesh_only is true but no mesh address resolved (set
base__ssh_listen_addr or ensure wt0 is up so its fact is gathered). Refusing to
render sshd ListenAddress empty (which would listen on ALL interfaces).
when: base__ssh_listen_mesh_only | bool
- name: Allow sshd to bind the mesh IP before wt0 exists at boot
ansible.posix.sysctl:
name: net.ipv4.ip_nonlocal_bind
value: "1"
sysctl_set: true
state: present
reload: true
sysctl_file: /etc/sysctl.d/60-boma-nonlocal-bind.conf
when: base__ssh_listen_mesh_only | bool
```
- [ ] **Step 6: Render the conditional `ListenAddress`**
In `roles/base/templates/sshd_hardening.conf.j2`, append after the existing `KbdInteractiveAuthentication no` line:
```jinja
{% if base__ssh_listen_mesh_only | bool %}
ListenAddress {{ base__ssh_listen_addr_resolved }}
{% endif %}
```
- [ ] **Step 7: Run the test to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — converge succeeds; verify confirms `ListenAddress 100.99.0.1`, the sysctl drop-in, and the live value `1`.
> **Checkpoint (environmental):** if `make test` fails on the sysctl task because the Molecule container can't write `net.ipv4.ip_nonlocal_bind`, add `sysctls: {net.ipv4.ip_nonlocal_bind: "0"}` to the platform in `roles/base/molecule/default/molecule.yml` (pre-creates the namespaced sysctl so the task can set it), then re-run. Note the change in the commit.
- [ ] **Step 8: Lint**
Run: `make lint`
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
- [ ] **Step 9: Commit**
```bash
git add roles/base/defaults/main.yml roles/base/tasks/ssh.yml \
roles/base/templates/sshd_hardening.conf.j2 \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): opt-in sshd ListenAddress on the mesh IP (fail-closed)
base__ssh_listen_mesh_only binds sshd to the live wt0 IP only, with
ip_nonlocal_bind to beat the post-boot bind race and a fail-closed assert so an
unresolved address never silently listens on all interfaces. Molecule covers
the render + sysctl. Mesh-hardening 1/3 (ADR-016/021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: firewall catalog — `public` zone + askari's public services
**Files:**
- Modify: `inventories/production/group_vars/all/firewall.yml`
- Modify: `roles/base/molecule/default/converge.yml` (fixture: public-zone rule)
- Modify: `roles/base/molecule/default/verify.yml` (assert the 0.0.0.0/0 rule)
- Test: `tests/test_firewall_rules.py` (unit: a `public` zone resolves to `0.0.0.0/0`)
Rationale: `base__firewall_mgmt_interface` already accepts `:22` on `wt0`. The gap is that the catalog is empty and has no "anywhere" source, so applying default-deny to askari would drop 80/443/3478. We add a `public` zone (`0.0.0.0/0`) and askari's service ingress.
- [ ] **Step 1: Write the failing unit test**
In `tests/test_firewall_rules.py`, add:
```python
def test_public_zone_resolves_to_anywhere():
catalog = {"web": {"host": "askari",
"ingress": [{"from": "public", "port": 443, "proto": "tcp"}]}}
zones = {"public": "0.0.0.0/0"}
rules = rs.resolve_firewall_rules(catalog, zones, "askari",
{"askari": {"ansible_host": "100.99.226.39"}}, {})
assert rules == [{"proto": "tcp", "port": 443, "sources": ["0.0.0.0/0"]}]
```
(Module is loaded by the existing importlib shim at the top of the test file as `rs`. If the filter is imported under a different alias there, match it.)
- [ ] **Step 2: Run it to verify it fails (or passes trivially)**
Run: `.venv/bin/python -m pytest tests/test_firewall_rules.py -q`
Expected: this test PASSES immediately if the filter already resolves arbitrary zones (it does — `_resolve_source` treats any `zones` key generically). That is fine: the unit test documents/locks the `public`-zone contract. If it fails, fix the filter. Either way it must end green.
- [ ] **Step 3: Add the Molecule fixture (public-zone rule)**
In `roles/base/molecule/default/converge.yml`, under `firewall_zones:` add `public: 0.0.0.0/0`, and under `firewall_catalog:` add:
```yaml
netbird_stun:
host: instance
ingress:
- { from: public, port: 3478, proto: udp }
```
- [ ] **Step 4: Add the Molecule assertion (the test)**
In `roles/base/molecule/default/verify.yml`, after the photoprism assertion block, add:
```yaml
- name: Assert the public->stun:3478/udp ingress rule (0.0.0.0/0 source)
ansible.builtin.assert:
that:
- "'0.0.0.0/0' in nft"
- "'udp dport 3478 accept' in nft"
fail_msg: "missing public->3478/udp rule for netbird_stun"
```
- [ ] **Step 5: Run the tests**
Run: `make test ROLE=base` then `.venv/bin/python -m pytest tests/test_firewall_rules.py -q`
Expected: both PASS (the rendered ruleset now contains the `0.0.0.0/0 ... udp dport 3478 accept` rule).
- [ ] **Step 6: Populate the real catalog**
In `inventories/production/group_vars/all/firewall.yml`, replace the `firewall_zones`/`firewall_catalog` blocks with:
```yaml
# Zone → subnet (from ADR-007). `public` = the WAN (anywhere) for deliberately public
# off-site services (askari); home/cluster services use the internal zones only.
firewall_zones:
mgmt: 10.10.0.0/24
srv: 10.20.0.0/24
lan: 10.30.0.0/24
iot: 10.40.0.0/24
guest: 10.50.0.0/24
public: 0.0.0.0/0
# Service catalog: <name> → placement (host | group | hosts) + ingress[].
# askari's public surface (ADR-024 Caddy + ADR-016 NetBird STUN). NOTE: the host
# nftables template renders IPv4 source rules only; askari is reached via its A record
# (no AAAA), so IPv4-only public rules are sufficient (see the spec's IPv6 note).
firewall_catalog:
reverse_proxy:
host: askari
ingress:
- { from: public, port: 80, proto: tcp }
- { from: public, port: 443, proto: tcp }
netbird_stun:
host: askari
ingress:
- { from: public, port: 3478, proto: udp }
```
- [ ] **Step 7: Lint**
Run: `make lint`
Expected: clean pass (`check-tags: OK`).
- [ ] **Step 8: Commit**
```bash
git add inventories/production/group_vars/all/firewall.yml \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
tests/test_firewall_rules.py
git commit -m "feat(firewall): public zone + askari's public services in the catalog
Adds a public (0.0.0.0/0) zone and askari's Caddy (80/443) + NetBird STUN
(3478/udp) ingress so the base nftables default-deny does not drop the live
public services when applied to askari. Molecule + filter unit test cover the
public-zone rendering. Mesh-hardening 1/3 (ADR-020/024/016).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 3: inventory — point Ansible at wt0 + enable mesh-only SSH on askari
**Files:**
- Create: `inventories/production/host_vars/askari.yml`
- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml`
- [ ] **Step 1: Create the host_var override**
Create `inventories/production/host_vars/askari.yml`:
```yaml
---
# Manage askari over the NetBird mesh (wt0), not its WAN IP. This OVERRIDES the
# TF-generated inventories/production/offsite.yml (ansible_host = 77.42.120.136); host_vars
# outrank the generated inventory and are NOT touched by `make tf-inventory-offsite`.
# Mesh-hardening 1/3 — once SSH is wt0-only, the WAN IP is no longer reachable for SSH.
ansible_host: 100.99.226.39 # askari's wt0 address (NetBird, M5)
```
- [ ] **Step 2: Enable mesh-only SSH for offsite hosts**
In `inventories/production/group_vars/offsite_hosts/vars.yml`, replace the file body with:
```yaml
---
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
# (ADR-016, M5). Mesh-hardening 1/3 (2026-06-17): SSH is moved onto wt0 — sshd binds the
# mesh IP only (base__ssh_listen_mesh_only) and the base nftables default-deny applies
# (base__firewall_apply defaults true; SSH allowed on wt0 via base__firewall_mgmt_interface,
# public services via the catalog). base__mesh_enabled stays true (precondition from M5).
base__mesh_enabled: true
base__ssh_listen_mesh_only: true
```
- [ ] **Step 3: Verify the override resolves**
Run: `.venv/bin/ansible-inventory -i inventories/production/ --host askari 2>/dev/null | grep ansible_host`
Expected: `"ansible_host": "100.99.226.39"` (the host_var wins over the generated `offsite.yml`).
- [ ] **Step 4: Lint**
Run: `make lint`
Expected: clean pass.
- [ ] **Step 5: Commit**
```bash
git add inventories/production/host_vars/askari.yml \
inventories/production/group_vars/offsite_hosts/vars.yml
git commit -m "feat(inventory): manage askari over wt0 + enable mesh-only SSH
host_vars/askari.yml points ansible_host at the wt0 IP (overriding the generated
offsite.yml); offsite_hosts sets base__ssh_listen_mesh_only. Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 4: Terraform — retire the Hetzner WAN `:22` rule
**Files:**
- Modify: `terraform/modules/hetzner_vm/main.tf`
- Modify: `terraform/modules/hetzner_vm/variables.tf`
- Modify: `terraform/environments/offsite/main.tf`
This task makes the SSH rule conditional and sets askari's admin CIDRs to empty (mesh-only). The live `tf-plan`/`tf-apply` happens in Task 5 — here we only change + format/validate the code.
- [ ] **Step 1: Gate the SSH rule on a non-empty CIDR list**
In `terraform/modules/hetzner_vm/main.tf`, replace the static SSH `rule { ... }` block (the one with `port = "22"`) with a dynamic block:
```hcl
# SSH from the control node only — and only when admin CIDRs are set. An empty
# ssh_admin_cidrs removes the WAN :22 rule entirely (mesh-only SSH; reach the host over
# wt0, break-glass = Hetzner console). Mesh-hardening 1/3.
dynamic "rule" {
for_each = length(var.ssh_admin_cidrs) > 0 ? [1] : []
content {
direction = "in"
protocol = "tcp"
port = "22"
source_ips = var.ssh_admin_cidrs
}
}
```
- [ ] **Step 2: Default the variable to empty**
In `terraform/modules/hetzner_vm/variables.tf`, change the `ssh_admin_cidrs` variable to default to an empty list:
```hcl
variable "ssh_admin_cidrs" {
description = "Source CIDRs allowed to reach SSH over the WAN. Empty = no WAN SSH rule (mesh-only)."
type = list(string)
default = []
}
```
- [ ] **Step 3: Set askari to mesh-only SSH**
In `terraform/environments/offsite/main.tf`, change the `ssh_admin_cidrs` argument in the `module "askari"` block to:
```hcl
ssh_admin_cidrs = [] # mesh-only: SSH is reached over wt0; WAN :22 retired (mesh-hardening 1/3)
```
- [ ] **Step 4: Format + validate**
Run: `cd terraform/environments/offsite && terraform fmt -recursive ../.. && terraform validate && cd -`
Expected: `fmt` lists any reformatted files (re-add them); `validate` prints `Success! The configuration is valid.` (offsite is already `init`ed — it has live state.)
- [ ] **Step 5: Commit**
```bash
git add terraform/modules/hetzner_vm/main.tf terraform/modules/hetzner_vm/variables.tf \
terraform/environments/offsite/main.tf
git commit -m "feat(tf/offsite): retire askari's WAN :22 (mesh-only SSH)
The Hetzner Cloud Firewall SSH rule is now conditional on a non-empty
ssh_admin_cidrs (default []); askari sets it empty so the WAN :22 rule is
removed on the next apply. SSH is reached over wt0; break-glass is the Hetzner
console. Apply is the live cutover (Task 5). Mesh-hardening 1/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
> This task touches the real askari over the network and is lockout-risky. Run it
> interactively with the operator, in order, verifying each step before the next. The
> firewall's auto-rollback timer + `wait_for_connection` over wt0 is the safety net; the
> Hetzner web console is the ultimate break-glass. Do NOT hand this to an unattended agent.
- [ ] **Step 1: Pre-check the mesh SSH path (before any change)**
Run: `.venv/bin/ansible askari -i inventories/production/ -m ping`
Expected: `SUCCESS` — confirms Ansible reaches askari over `wt0` (Tasks 13 are merged, so `ansible_host` is now `100.99.226.39`). If this fails, STOP — the mesh path must work before closing the WAN.
- [ ] **Step 2: Dry-run the base apply (firewall + sshd)**
Run: `make check PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening`
Expected: shows the nftables ruleset diff (default-deny + wt0 SSH + public 80/443/3478) and the sshd drop-in diff (`ListenAddress 100.99.226.39`); no errors. Review that the public service rules are present (so they won't be dropped).
- [ ] **Step 3: Apply the host firewall + sshd (auto-rollback armed)**
Run: `make deploy PLAYBOOK=site LIMIT=askari TAGS=firewall,hardening`
Expected: the firewall concern arms the rollback timer, applies, resets the connection, and `wait_for_connection` succeeds over wt0; sshd reloads with the mesh ListenAddress. If connectivity is lost, the timer auto-reverts the ruleset within `base__firewall_rollback_timeout` (45 s).
- [ ] **Step 4: Verify services + WAN SSH still open at the cloud edge**
```bash
curl -sSf -o /dev/null -w '%{http_code}\n' https://test.askari.wingu.me # expect 200
curl -sSf -o /dev/null -w '%{http_code}\n' https://netbird.askari.wingu.me # expect 200
```
Expected: both `200` (valid certs); the host firewall did not drop the public services. (WAN `:22` is now dropped by the host nftables, but the Hetzner FW still allows it until Step 5 — that's fine.)
- [ ] **Step 5: Retire the Hetzner WAN `:22` — plan, review, apply**
Run: `make tf-plan TF_ENV=offsite`
Expected: the plan shows the SSH firewall rule being **destroyed** (and nothing else of substance). Review it.
Then: `make tf-apply TF_ENV=offsite`
Expected: apply succeeds; the WAN `:22` rule is gone.
- [ ] **Step 6: Verify the end-state (out-of-band)**
From an OFF-MESH host (e.g. the operator's laptop with NetBird disconnected, or a quick check from askari's perspective):
```bash
nc -vz -w5 77.42.120.136 22 # expect: refused / timeout (WAN SSH closed)
nc -vz -w5 77.42.120.136 443 # expect: open (public service intact)
```
And from ubongo over the mesh: `.venv/bin/ansible askari -i inventories/production/ -m ping``SUCCESS`.
- [ ] **Step 7: Reboot resilience check (optional but recommended)**
Reboot askari from the Hetzner console; after it comes back, confirm `ansible askari -m ping` succeeds over wt0 without intervention (proves `ip_nonlocal_bind` beat the post-boot bind race).
- [ ] **Step 8: Update STATUS + ROADMAP**
- In `STATUS.md`, update the askari row: SSH is now wt0-only; the host nftables default-deny is applied; the Hetzner WAN `:22` is retired. Move "host firewall + moving askari's SSH onto wt0" out of *Pending*.
- In `docs/ROADMAP.md`, mark mesh-hardening sub-project 1 (askari SSH→wt0) done; next is sub-project 2 (ubongo default-deny).
```bash
git add STATUS.md docs/ROADMAP.md
git commit -m "docs: askari SSH moved onto wt0 (mesh-hardening 1/3 done)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
- [ ] **Step 9: Push**
Run: `git push origin main`
---
## Self-review (against the spec)
- **§ three layers** → Task 1 (sshd ListenAddress), Task 2 (nftables catalog; SSH-on-wt0 pre-existing via `base__firewall_mgmt_interface`), Task 4 (Hetzner WAN :22). ✓
- **§ boot-race fix** (`ip_nonlocal_bind` + fail-closed assert + live wt0 fact) → Task 1 Steps 46. ✓
- **§ new code/vars** (`base__ssh_listen_mesh_only`, `base__ssh_listen_addr`, host_vars/askari.yml, offsite flag, catalog, TF) → Tasks 14. ✓
- **§ staged cutover** → Task 5 Steps 16, with the firewall auto-rollback as the gate. ✓
- **§ testing** → Molecule render asserts (ListenAddress, sysctl, public-zone rule) + filter unit test + live out-of-band checks. The fail-closed assert is exercised by code; to spot-check it, temporarily blank `base__ssh_listen_addr` in the converge fixture and confirm `make test ROLE=base` fails on the assert, then revert (manual, not automated — a deliberate-failure Molecule scenario is non-idiomatic). ✓
- **§ risks/rollback** → auto-rollback timer (Task 5 Step 3), `ip_nonlocal_bind` (Task 1), Hetzner console break-glass, re-addable TF rule. ✓
- **IPv6 note** → recorded in the catalog comment (Task 2 Step 6); acceptable because askari has only an A record.

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,409 @@
# Mesh-hardening redesign (askari) — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Harden askari's inbound surface with the proven ubongo INPUT-only default-deny pattern (SSH scoped by `iifname "wt0"` + a permanent WAN break-glass), and make the NetBird coordinator survive a no-egress startup — reboot-safe, no boot-race, no lockout.
**Architecture:** Mirror mesh-hardening 2/3 (ubongo): `base` firewall INPUT-only (`base__firewall_input_only: true`, forward stays `policy accept` so Docker forwarding/NAT survive), **no** sshd `ListenAddress` change (the firewall, not sshd, scopes `:22`). The coordinator-host exception: WAN `:22` stays open from ubongo's static WAN IP as the always-available non-mesh break-glass (the Hetzner console is the ultimate fallback). A `netbird_coordinator` change disables geolocation so a transient egress loss can't FATAL the control plane. Validate firewall reboot-safety on a throwaway VM (ADR-025 harness) GREEN before a supervised live cutover.
**Tech Stack:** Ansible (`base`, `netbird_coordinator` roles), nftables, Docker Compose, Molecule (Debian 13), the `scripts/integration-vm.py` ADR-025 harness, NetBird self-hosted `netbird-server:0.72.4`.
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md`
## Global Constraints
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
- **No sshd `ListenAddress` change**`base__ssh_listen_mesh_only` stays `false` everywhere here (this is what sidesteps the 2026-06-17 boot-race).
- **WAN `:22` is never closed** — no Terraform / Hetzner-Cloud-Firewall change in this plan.
- **`base__firewall_input_only: true` on askari** — the forward chain must stay `policy accept` (Docker host). Never apply a forward-`drop` firewall to askari.
- **ubongo's WAN IP is `91.226.145.80`** (operator-confirmed static 2026-06-19) — the break-glass anchor.
- **askari `wt0` IP is `100.99.226.39`**; askari domain `netbird.askari.wingu.me`.
- **Before any commit:** `rbw unlocked` must succeed (the pre-commit hook decrypts `vault.yml`); run `make lint` and it must be clean.
- **Tags:** import each role at play level with its role-name tag; only use concern tags from `tests/tags.yml`.
- **Harness GREEN before live** (Task 3 before Task 4). The live cutover (Task 4) is **operator-gated** — never run autonomously.
---
### Task 1: Disable geolocation in `netbird_coordinator` (FRICTION 2026-06-17 #4)
Make the control plane survive a startup with no container egress: NetBird's combined server downloads the GeoLite2 DB at boot and treats failure as FATAL. boma uses no geo posture (ACL is Allow-All), so disable geolocation entirely via the documented env var. TDD'd through the role's render-only Molecule scenario.
> verified: NetBird self-hosted geolocation knobs (`NB_DISABLE_GEOLOCATION`, `disableGeoliteUpdate`, GeoLite2 pre-seed) · WebFetch · docs.netbird.io/selfhosted/geo-support · 2026-06-19 — *from a docs summary; the live "healthy with egress blocked" check in Task 4 is the real gate, with a concrete pre-seed fallback there.*
**Files:**
- Modify: `roles/netbird_coordinator/defaults/main.yml` (add the knob)
- Modify: `roles/netbird_coordinator/templates/docker-compose.yml.j2:14-27` (add `environment:` to `netbird-server`)
- Test: `roles/netbird_coordinator/molecule/default/verify.yml:21-32` (assert the rendered compose)
- Modify: `roles/netbird_coordinator/README.md` (one line documenting the knob)
**Interfaces:**
- Produces: role default `netbird_coordinator__disable_geolocation` (bool, default `true`); rendered compose env `NB_DISABLE_GEOLOCATION: "true"` on the `netbird-server` service.
- [ ] **Step 1: Write the failing Molecule assertion**
Append to `roles/netbird_coordinator/molecule/default/verify.yml` (after the existing compose-tags assert, inside the same `tasks:` list):
```yaml
- name: Assert geolocation is disabled (FRICTION 2026-06-17 #4 — no geo-DB download FATAL)
ansible.builtin.assert:
that:
- "'NB_DISABLE_GEOLOCATION: \"true\"' in (_compose.content | b64decode)"
fail_msg: >-
compose must set NB_DISABLE_GEOLOCATION=true so a no-egress startup can't FATAL
the coordinator on the GeoLite2 download
success_msg: "geolocation disabled in compose"
```
- [ ] **Step 2: Run Molecule to verify it fails**
Run: `make test ROLE=netbird_coordinator`
Expected: FAIL at "Assert geolocation is disabled" — the rendered compose has no `NB_DISABLE_GEOLOCATION`.
- [ ] **Step 3: Add the default knob**
Add to `roles/netbird_coordinator/defaults/main.yml` (after line 7, the `__domain` line):
```yaml
# Disable NetBird's GeoLite2 geolocation (download + lookups). boma uses no geo posture
# (ACL is Allow-All), and the combined server treats a failed GeoLite2 download as FATAL —
# so a transient egress loss (NAT wiped on `nft flush`, or the boot window before Docker
# re-adds NAT) would crash-loop the whole control plane (FRICTION 2026-06-17 #4). Disabling
# removes that dependency. Revisit if a future ACL sub-project wants geo-based posture.
netbird_coordinator__disable_geolocation: true
```
- [ ] **Step 4: Render the env in the compose template**
In `roles/netbird_coordinator/templates/docker-compose.yml.j2`, add an `environment:` block to the `netbird-server` service, immediately after its `command:` line (line 18):
```yaml
environment:
# Disable geolocation so a no-egress startup can't FATAL the control plane
# (FRICTION 2026-06-17 #4). boma uses no geo posture (ACL Allow-All).
NB_DISABLE_GEOLOCATION: "{{ netbird_coordinator__disable_geolocation | string | lower }}"
```
- [ ] **Step 5: Run Molecule to verify it passes**
Run: `make test ROLE=netbird_coordinator`
Expected: PASS — all asserts green, including "geolocation disabled in compose"; Molecule idempotence clean.
- [ ] **Step 6: Document the knob**
Add one line to `roles/netbird_coordinator/README.md` under its variables/defaults section:
```markdown
- `netbird_coordinator__disable_geolocation` (default `true`) — sets `NB_DISABLE_GEOLOCATION` so a no-egress startup can't FATAL the server on the GeoLite2 download (FRICTION 2026-06-17 #4).
```
- [ ] **Step 7: Lint and commit**
```bash
rbw unlocked && make lint
git add roles/netbird_coordinator/defaults/main.yml \
roles/netbird_coordinator/templates/docker-compose.yml.j2 \
roles/netbird_coordinator/molecule/default/verify.yml \
roles/netbird_coordinator/README.md
git commit -m "feat(netbird_coordinator): disable geolocation so no-egress startup can't FATAL the control plane" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: Enable askari's host firewall (INPUT-only) + WAN break-glass + manage over `wt0`
Flip askari from "firewall not applied" to the redesigned INPUT-only default-deny, add the permanent WAN break-glass source, and point Ansible at the mesh. Pure inventory change — validated by lint + inventory resolution (the firewall *behavior* is proven in Task 3).
**Files:**
- Modify: `inventories/production/group_vars/offsite_hosts/vars.yml` (replace the whole file body)
- Create: `inventories/production/host_vars/askari.yml`
**Interfaces:**
- Consumes: `base` knobs `base__firewall_apply`, `base__firewall_input_only`, `base__firewall_admin_addrs`, `base__ssh_listen_mesh_only`, `base__mesh_enabled` (all defined in `roles/base/defaults/main.yml`).
- Produces: askari resolves `ansible_host: 100.99.226.39`, `base__firewall_apply: true`, `base__firewall_input_only: true`, `base__firewall_admin_addrs: ["91.226.145.80"]`.
- [ ] **Step 1: Rewrite the offsite group_vars**
Replace the body of `inventories/production/group_vars/offsite_hosts/vars.yml` with:
```yaml
---
# Off-site hosts (askari). askari runs the NetBird coordinator AND is a mesh peer
# (ADR-016, M5).
#
# Mesh-hardening REDESIGN (2026-06-19): the 2026-06-17 attempt was backed out (forward
# `policy drop` broke Docker on reboot; wt0-only sshd left no break-glass; ip_nonlocal_bind
# did not beat the boot-race). The redesign mirrors the proven ubongo 2/3 pattern:
# - INPUT-only default-deny (base__firewall_input_only) — forward stays `policy accept`
# so Docker container forwarding/NAT survive a reboot;
# - SSH scoped by the host firewall (iifname wt0 + admin-addr), NOT a sshd ListenAddress
# change — base__ssh_listen_mesh_only stays false, so there is no boot-race;
# - WAN :22 is DELIBERATELY left open from ubongo's WAN IP (base__firewall_admin_addrs)
# as the permanent non-mesh break-glass — the coordinator-host exception (a host's only
# management path must never depend on a service that host itself hosts).
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
base__mesh_enabled: true
base__firewall_apply: true
base__firewall_input_only: true # forward stays `policy accept` → Docker-safe
base__ssh_listen_mesh_only: false # no sshd ListenAddress change → no boot-race
base__firewall_admin_addrs:
- 91.226.145.80 # ubongo's (static) WAN IP — the permanent non-mesh SSH break-glass
```
- [ ] **Step 2: Create the askari host_vars to manage over the mesh**
Create `inventories/production/host_vars/askari.yml`:
```yaml
---
# Manage askari over the NetBird mesh (wt0). Overrides the TF-generated WAN `ansible_host`
# in offsite.yml (host_vars are NOT regenerated by tf_to_inventory.py). The WAN :22 path
# (Hetzner Cloud Firewall + base__firewall_admin_addrs = ubongo's WAN) stays as the
# break-glass; the Hetzner web console is the IP-independent ultimate fallback.
# Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-askari-redesign-design.md
ansible_host: 100.99.226.39
```
- [ ] **Step 3: Verify the inventory resolves**
Run: `ansible-inventory -i inventories/production --host askari`
Expected: JSON shows `"ansible_host": "100.99.226.39"`, `"base__firewall_apply": true`, `"base__firewall_input_only": true`, and `"base__firewall_admin_addrs": ["91.226.145.80"]`.
- [ ] **Step 4: Lint**
Run: `rbw unlocked && make lint`
Expected: clean (no yamllint/ansible-lint errors).
- [ ] **Step 5: Commit**
```bash
git add inventories/production/group_vars/offsite_hosts/vars.yml \
inventories/production/host_vars/askari.yml
git commit -m "feat(inventory): askari INPUT-only firewall + WAN break-glass + manage over wt0" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 3: Integration harness "askari_inputonly" profile — the reboot-safety GREEN gate
Prove on a throwaway VM (ADR-025) that the redesigned firewall is reboot-safe BEFORE touching the real host: INPUT default-deny + forward accept + the admin-addr break-glass + published-port DNAT all survive a reboot. New profile (keeps the existing `askari` profile, which validates the `docker_host` container-forward drop-in path, intact).
**Files:**
- Create: `tests/integration/profiles/askari_inputonly.json`
- Create: `tests/integration/overrides/askari_inputonly.yml`
- Modify: `tests/integration/verify.yml` (allow-list + a new profile branch)
**Interfaces:**
- Consumes: the `scripts/integration-vm.py` harness; `make test-integration HOST=<profile>` maps `HOST` to `profiles/<HOST>.json` (a profile name, not a production inventory host).
- Produces: profile `askari_inputonly` with `integration_profile: askari_inputonly`.
- [ ] **Step 1: Add the new profile to the verify allow-list and a failing branch**
In `tests/integration/verify.yml`, change the allow-list assert (line 14) from:
```yaml
- integration_profile in ['askari', 'ubongo']
```
to:
```yaml
- integration_profile in ['askari', 'askari_inputonly', 'ubongo']
```
and update its `fail_msg` (line 15) to `"integration_profile must be set in the profile overlay (askari|askari_inputonly|ubongo)"`. Then append this block to the `tasks:` list (after the ubongo block):
```yaml
# ── askari_inputonly profile — the mesh-hardening REDESIGN (2026-06-19) ──
# INPUT-only default-deny on a Docker host: input policy drop, forward policy ACCEPT
# (Docker-safe), SSH via the admin-addr break-glass, published-port DNAT survives reboot.
- name: (askari_inputonly) Read the live nftables ruleset
when: integration_profile == 'askari_inputonly'
ansible.builtin.command: nft list ruleset
register: _nft_io
changed_when: false
- name: (askari_inputonly) INPUT default-deny, forward permissive, admin-addr break-glass
when: integration_profile == 'askari_inputonly'
ansible.builtin.assert:
that:
- "'hook input priority filter; policy drop;' in _nft_io.stdout"
- "'hook forward priority filter; policy accept;' in _nft_io.stdout"
- "'ip saddr 192.168.150.1 tcp dport 22 accept' in _nft_io.stdout"
fail_msg: >-
askari_inputonly: expected input policy drop, forward policy accept (input-only),
and the admin-addr break-glass (192.168.150.1) SSH allow in the live ruleset.
- name: (askari_inputonly) Gather service facts
when: integration_profile == 'askari_inputonly'
ansible.builtin.service_facts:
- name: (askari_inputonly) Docker daemon is active
when: integration_profile == 'askari_inputonly'
ansible.builtin.assert:
that: "ansible_facts.services['docker.service'].state == 'running'"
fail_msg: "docker.service is not running"
- name: (askari_inputonly) Published port answers from the controller (DNAT + forward alive)
when: integration_profile == 'askari_inputonly'
delegate_to: localhost
become: false
ansible.builtin.uri:
url: "http://{{ ansible_host }}/"
follow_redirects: none
status_code: [200, 301, 308, 404, 502, 503]
timeout: 10
register: _probe_io
retries: 5
delay: 6
until: _probe_io is succeeded
```
- [ ] **Step 2: Create the profile descriptor**
Create `tests/integration/profiles/askari_inputonly.json`:
```json
{
"groups": ["offsite_hosts"],
"applies": [
{"playbook": "site.yml", "tags": ["base"]},
{"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
],
"extra_vars_files": ["overrides/askari_inputonly.yml"],
"mem_mib": 3072,
"vcpus": 2
}
```
- [ ] **Step 3: Create the overlay**
Create `tests/integration/overrides/askari_inputonly.yml`:
```yaml
---
# Integration overlay (ADR-025) — the askari mesh-hardening REDESIGN (2026-06-19).
# Validates INPUT-only default-deny on a Docker host: input policy drop, forward policy
# accept (Docker-safe), SSH via the admin-addr break-glass, reboot-survivable.
integration_profile: askari_inputonly
base__firewall_apply: true
base__firewall_input_only: true
# No sshd ListenAddress change — never wt0-only in a throwaway VM.
base__ssh_listen_mesh_only: false
# Isolated VM: never touch the real mesh.
base__mesh_enabled: false
# The non-mesh SSH break-glass = the admin-addr path the real design uses. Point it at the
# VM's libvirt-NAT gateway (where the harness connects from), by source IP so it is
# interface-independent and the default-deny + reboot don't lock out the driver. This
# mirrors askari's real base__firewall_admin_addrs (ubongo's WAN) in the test topology.
base__firewall_admin_addrs:
- 192.168.150.1
```
- [ ] **Step 4: Run the harness — the GREEN gate**
Run: `make test-integration HOST=askari_inputonly`
Expected: GREEN. The harness boots a VM, applies `base` (INPUT-only) + `docker_host` + `reverse_proxy`, **reboots**, re-SSHes (proving the admin-addr break-glass survives), then `verify.yml` asserts input `policy drop`, forward `policy accept`, the `192.168.150.1` SSH allow, Docker active, and the published `:80` answering. Clean up: `make test-integration-clean`.
- [ ] **Step 5: Commit**
```bash
rbw unlocked && make lint
git add tests/integration/profiles/askari_inputonly.json \
tests/integration/overrides/askari_inputonly.yml \
tests/integration/verify.yml
git commit -m "test(integration): askari_inputonly profile — INPUT-only default-deny reboot gate" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 4: Supervised live cutover + STATUS/ROADMAP update — ⚠️ OPERATOR-GATED
> **⚠️ DO NOT run this task autonomously.** It changes the live off-site host (lockout risk) and runs `make deploy`. An automated executor must STOP here and hand back to the operator. Preconditions: Tasks 13 committed and GREEN; `rbw unlocked`; the **Hetzner web console** open in a browser (the out-of-band ultimate break-glass); the operator present. The WAN `:22` break-glass is never removed, so a fallback path is open throughout (FRICTION 2026-06-17 #6).
**Files (Step 7 only):**
- Modify: `STATUS.md` (askari row), `docs/ROADMAP.md` (Next step)
- [ ] **Step 1: Pre-check both paths are healthy**
```bash
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
ansible askari -i inventories/production -m ping
curl -sI https://test.askari.wingu.me | head -1
curl -sI https://netbird.askari.wingu.me | head -1
```
Expected: wt0 SSH OK; ping `pong`; both curls `HTTP/2 200`.
- [ ] **Step 2: Dry-run the converge (mandatory `check` before `deploy`)**
```bash
make check PLAYBOOK=site LIMIT=askari
```
Expected: changes limited to the `base` firewall (input-only ruleset, admin-addr) + the `netbird_coordinator` compose env (`NB_DISABLE_GEOLOCATION`). Review and show the output before proceeding.
- [ ] **Step 3: Apply (operator present, console open, auto-rollback armed)**
```bash
make deploy PLAYBOOK=site LIMIT=askari
```
The `base` firewall concern arms the auto-rollback timer (`base__firewall_rollback_timeout: 45`) and reconnects over `wt0` — a bad ruleset reverts itself. Expected: converge OK; SSH-over-`wt0` stays up.
- [ ] **Step 4: Rebuild NAT and confirm the coordinator is healthy with geo disabled**
`base`'s `flush ruleset` wipes Docker's nat (FRICTION) — rebuild it, then confirm the control plane:
```bash
ssh sjat@100.99.226.39 'sudo systemctl restart docker'
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
ssh sjat@100.99.226.39 'docker logs --since 2m netbird-server 2>&1 | grep -iE "geo|fatal" || echo "no geo/fatal log lines"'
```
Expected: `netbird-server` + `netbird-dashboard` Up; no geo-DB FATAL.
> **Contingency (only if `netbird-server` still FATALs on geolocation):** `NB_DISABLE_GEOLOCATION` was not honored by the pinned image. Pre-seed the DB into the volume instead — `ssh sjat@100.99.226.39 'sudo curl -fSL -o /var/lib/docker/volumes/netbird_data/_data/GeoLite2-City_20260101.mmdb https://pkgs.netbird.io/geolite2/GeoLite2-City.mmdb && sudo docker restart netbird-server'` — and add `disableGeoliteUpdate: true` under `server:` in `config.yaml.j2` so it never re-downloads. Re-verify, then fold the working fix back into the role (amend Task 1).
- [ ] **Step 5: Verify the new steady state (both SSH paths + services)**
```bash
ssh sjat@100.99.226.39 true && echo "wt0 SSH OK"
# From ubongo: SSH to askari's WAN IP. ubongo's packets egress via OPNsense, SNAT'd to the
# WAN IP 91.226.145.80 — matching askari's admin-addr break-glass rule. (No BindAddress:
# ubongo does not hold 91.226.145.80; OPNsense does.)
ssh sjat@77.42.120.136 true && echo "WAN break-glass OK"
curl -sI https://test.askari.wingu.me | head -1
nc -vz -u 77.42.120.136 3478 # STUN answers
```
Expected: both SSH paths succeed; cert valid; STUN reachable.
- [ ] **Step 6: Reboot-resilience — the real test (console available)**
```bash
ssh sjat@100.99.226.39 'sudo systemctl reboot'
# wait ~60s, then from ubongo — no manual intervention:
sleep 60; ssh sjat@100.99.226.39 'nft list chain inet filter input | grep -E "policy drop|wt0|91.226.145.80"'
curl -sI https://netbird.askari.wingu.me | head -1
ssh sjat@100.99.226.39 'docker ps --format "{{.Names}} {{.Status}}"'
```
Expected, unattended: input `policy drop` with the `wt0` + `91.226.145.80` allows; public cert valid; both containers Up; `wt0` SSH back. (If lost: recover via the Hetzner console — the firewall auto-rollback and the WAN break-glass should make that unnecessary.)
- [ ] **Step 7: Record reality in the ground-truth docs and commit**
Update `STATUS.md` (the askari row): firewall now **applied** — INPUT-only default-deny, SSH `wt0`-primary + permanent WAN break-glass (ubongo's WAN), managed over `wt0`, geolocation disabled, **reboot-validated**. Update `docs/ROADMAP.md` "Next step": mark the askari SSH→`wt0` redesign **DONE**; the next mesh-hardening sub-project is the **SPOF reduction** (askari relay single-point-of-failure) — confirmed by the `ubongo → askari` `Relayed` finding (2026-06-19).
```bash
rbw unlocked && make lint
git add STATUS.md docs/ROADMAP.md
git commit -m "docs(status): mesh-hardening redesign — askari INPUT-only + WAN break-glass applied + reboot-validated" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Notes / out of scope (carry to the SPOF sub-project)
- **SPOF reduction is the next sub-project** (operator decision 2026-06-19): `ubongo → askari` is currently `Relayed` through askari's own relay; if askari is down, relayed peers lose the mesh data plane. Its own spec.
- **NetBird ACL stays Allow-All** — any enrolled peer can reach askari `wt0:22` until a later sub-project.
- **Full forward-chain hardening** (`docker_host` container-forward drop-in over the `input_only` baseline) — a later tightening; the existing `askari` integration profile already covers that path.
- **Coordinator off-site backup** (FRICTION 2026-06-17 #5, ADR-022) — still pending; not in scope.

View file

@ -0,0 +1,470 @@
# Mesh-hardening 2/3 — ubongo INPUT-only default-deny — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Apply base's nftables firewall to the control node (ubongo) as an INPUT-only default-deny — hardening its inbound surface — while leaving the forward chain permissive so Docker egress and the libvirt-NAT integration harness keep working, and without any sshd `ListenAddress` change.
**Architecture:** Two new `base` knobs make the existing firewall concern fit a control node: `base__firewall_input_only` flips the forward chain to `policy accept` (host-local input filtering only), and `base__firewall_admin_addrs` adds operator-workstation LAN sources to the SSH allow-list (alongside `wt0` and `ssh-from-control`). sshd is untouched (nftables does the scoping → no `ip_nonlocal_bind` boot-race). The change is validated on a throwaway VM via the ADR-025 integration harness (a new "be ubongo" profile) before an operator-supervised live cutover whose safety net is the firewall auto-rollback timer plus the permanent on-prem physical console.
**Tech Stack:** Ansible (role `base`, FQCN), nftables, Jinja2, Molecule on Debian 13, pytest (none new), the ADR-025 integration harness (`scripts/integration-vm.py`, JSON profiles, `-e @` overlays).
**Spec:** `docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md`
**Conventions:** `make lint` and `make test ROLE=base` before each commit; `make check` before `make deploy`; never hand-edit the generated `offsite.yml`; `rbw unlocked` for any commit touching Ansible content and for the integration/live applies (the production `group_vars/all/vault.yml` is in inventory scope and gets decrypted at playbook load). Tasks 13 are code (subagent-driven, each lint/Molecule-verified). Task 4 is a real-VM validation gate on ubongo. Task 5 is the live, operator-supervised cutover.
---
## File Structure
| File | Create/Modify | Responsibility |
|---|---|---|
| `roles/base/defaults/main.yml` | Modify | Declare `base__firewall_input_only` + `base__firewall_admin_addrs` (defaults: off / empty). |
| `roles/base/templates/nftables.conf.j2` | Modify | Conditional forward policy; render an SSH-allow rule per admin address. |
| `roles/base/molecule/default/converge.yml` | Modify | Fixture: an admin-addr source (input-only stays at its default → forward drop). |
| `roles/base/molecule/default/verify.yml` | Modify | Assert forward-drop default + the admin-addr rule render. |
| `inventories/production/group_vars/control/vars.yml` | Modify | Turn the knobs on for ubongo (input-only; mamba's LAN IP). |
| `tests/integration/overrides/ubongo.yml` | Create | The "be ubongo" overlay (input-only firewall; harness SSH lifeline). |
| `tests/integration/profiles/ubongo.json` | Create | The "be ubongo" VM profile (group `control`, applies `site.yml:base`). |
| `tests/integration/overrides/askari.yml` | Modify | Add the `integration_profile` marker (verify is now profile-aware). |
| `tests/integration/verify.yml` | Modify | Gate the askari (Docker/DNAT) block; add the ubongo (input-only) block + a guard. |
| `STATUS.md`, `docs/ROADMAP.md` | Modify (Task 5) | Record mesh-hardening 2/3 done. |
---
### Task 1: base role — `base__firewall_input_only` (forward policy) + `base__firewall_admin_addrs` (LAN SSH allow)
**Files:**
- Modify: `roles/base/defaults/main.yml`
- Modify: `roles/base/templates/nftables.conf.j2`
- Modify: `roles/base/molecule/default/converge.yml`
- Modify: `roles/base/molecule/default/verify.yml`
> **Test strategy (note):** Molecule renders one fixture, so it locks the *secure default*
> `input_only` **off** → forward `policy drop` — plus the new admin-addr rule (red→green). The
> `input_only` **on** → forward `policy accept` path is exercised on a real VM by the
> integration "be ubongo" profile (Tasks 34), whose verify fails red until this template
> conditional exists. Both branches are covered, across the two test layers.
- [ ] **Step 1: Write the failing test (extend Molecule verify)**
In `roles/base/molecule/default/verify.yml`, after the `Assert the docker_host extension hook is present` block, add:
```yaml
- name: Assert the forward chain defaults to policy drop (input_only off)
ansible.builtin.assert:
that:
- "'hook forward priority 0; policy drop;' in nft"
fail_msg: >-
forward chain must default to policy drop when base__firewall_input_only is
false (container isolation stays the norm on real service hosts)
- name: Assert the admin-addr SSH allow rule (operator workstation on the LAN)
ansible.builtin.assert:
that:
- "'ip saddr 10.30.0.77 tcp dport 22 accept' in nft"
fail_msg: "missing admin-addr SSH allow rule from base__firewall_admin_addrs"
```
- [ ] **Step 2: Add the fixture that drives it (Molecule converge)**
In `roles/base/molecule/default/converge.yml`, add to the `vars:` block (after the `base__firewall_control_addr` line):
```yaml
base__firewall_admin_addrs:
- "10.30.0.77" # fixture: an operator-workstation LAN source (admin-addr SSH allow)
```
- [ ] **Step 3: Run the test to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL on `Assert the admin-addr SSH allow rule` (the template does not consume `base__firewall_admin_addrs` yet, so the `ip saddr 10.30.0.77 …` rule is absent). The forward-drop assertion passes already (the template currently hardcodes `policy drop`).
- [ ] **Step 4: Add the defaults**
In `roles/base/defaults/main.yml`, after the `base__firewall_apply: true` line (end of the firewall behaviour block, currently line 13), add:
```yaml
base__firewall_input_only: false # true → the forward chain is `policy accept` (host-local
# INPUT filtering only). For hosts that forward/route
# container or NAT traffic (the control node's Docker +
# libvirt-NAT) where a forward default-deny would break
# them. Real service hosts keep this false (forward drop).
base__firewall_admin_addrs: [] # extra LAN source IPs allowed to SSH, besides wt0 +
# ssh-from-control. For an operator workstation reaching
# the host over the LAN (no mesh). Key-gated. (ADR-021)
```
- [ ] **Step 5: Make the forward policy conditional + render the admin-addr rules**
In `roles/base/templates/nftables.conf.j2`:
(a) Replace the forward-chain line (currently line 21):
```jinja
chain forward { type filter hook forward priority 0; policy {{ 'accept' if base__firewall_input_only | bool else 'drop' }}; }
```
(b) After the `ssh-from-control` `{% endif %}` (currently line 14) and before the `ip protocol icmp accept` line, add the admin-addr loop:
```jinja
{% for addr in base__firewall_admin_addrs %}
ip saddr {{ addr }} tcp dport {{ base__firewall_ssh_port }} accept
{% endfor %}
```
- [ ] **Step 6: Run the test to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — converge renders the ruleset; verify confirms the forward chain is `policy drop` (input_only defaults false) and the `ip saddr 10.30.0.77 tcp dport 22 accept` rule is present; all pre-existing assertions stay green.
- [ ] **Step 7: Lint**
Run: `make lint`
Expected: `Passed: 0 failure(s)` and `check-tags: OK`.
- [ ] **Step 8: Commit**
```bash
git add roles/base/defaults/main.yml roles/base/templates/nftables.conf.j2 \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml
git commit -m "feat(base): input-only forward policy + admin-addr SSH allow
base__firewall_input_only renders the forward chain policy accept (host-local
INPUT filtering only) for hosts that forward container/NAT traffic; defaults
false so real service hosts keep the forward default-deny. base__firewall_admin_addrs
adds operator-workstation LAN sources to the SSH allow-list alongside wt0 +
ssh-from-control. Molecule locks the secure default + the admin rule.
Mesh-hardening 2/3 (ADR-020/021).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: inventory — enable input-only default-deny + mamba on ubongo (control group)
**Files:**
- Modify: `inventories/production/group_vars/control/vars.yml`
- [ ] **Step 1: Turn the knobs on for the control group**
Append to `inventories/production/group_vars/control/vars.yml`:
```yaml
# Mesh-hardening 2/3 (2026-06-19, ADR-020/021): apply base's host firewall to ubongo as
# INPUT-only default-deny — harden the inbound surface, leave the forward chain permissive so
# Docker egress + the libvirt-NAT integration harness keep working. sshd is unchanged
# (nftables scopes inbound), so there is no boot-race. Reach ubongo over wt0 (mesh), the
# ssh-from-control self-path (base__firewall_control_addr, group_vars/all = 10.20.10.151), or
# mamba on the LAN. Break-glass: the physical console. (base__firewall_apply defaults true.)
base__firewall_input_only: true
base__firewall_admin_addrs:
- "10.20.10.50" # mamba over the LAN (NetBird off). Raw DHCP lease — revisit with an
# OPNsense reservation when OPNsense-as-code lands; backstopped by wt0.
- "10.20.10.17" # 2nd operator workstation (MAC bc:0f:f3:c8:4a:8a). Raw lease — ditto.
```
- [ ] **Step 2: Verify the vars resolve for ubongo**
Run: `.venv/bin/ansible-inventory -i inventories/production/ --host ubongo 2>/dev/null | grep -E 'firewall_input_only|firewall_admin_addrs|10.20.10.(50|17)'`
Expected: shows `"base__firewall_input_only": true` and `"base__firewall_admin_addrs": ["10.20.10.50", "10.20.10.17"]`.
- [ ] **Step 3: Lint**
Run: `make lint`
Expected: clean pass (`check-tags: OK`).
- [ ] **Step 4: Commit**
```bash
git add inventories/production/group_vars/control/vars.yml
git commit -m "feat(inventory): ubongo gets INPUT-only host firewall + mamba LAN SSH
Enables base__firewall_input_only on the control group (forward chain stays
permissive so Docker egress + the integration-test libvirt NAT survive) and
allows the operator workstations' LAN IPs (mamba 10.20.10.50 + 10.20.10.17;
raw leases, backstopped by wt0). Mesh-hardening 2/3.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 3: integration harness — "be ubongo" profile (overlay + profile + profile-aware verify)
**Files:**
- Create: `tests/integration/overrides/ubongo.yml`
- Create: `tests/integration/profiles/ubongo.json`
- Modify: `tests/integration/overrides/askari.yml`
- Modify: `tests/integration/verify.yml`
- [ ] **Step 1: Create the "be ubongo" overlay**
Create `tests/integration/overrides/ubongo.yml`:
```yaml
---
# Integration-test overlay for the "ubongo" profile (ADR-025). Passed via `-e @`.
# Exercises mesh-hardening 2/3: base's INPUT-only default-deny on the control node — input
# chain default-deny, forward chain left permissive (Docker/libvirt-NAT safe), no sshd
# ListenAddress change (so no boot-race).
integration_profile: ubongo
base__firewall_apply: true
base__firewall_input_only: true # forward chain renders `policy accept`
base__firewall_admin_addrs:
- "192.168.150.98" # two representative LAN sources — exercises the
- "192.168.150.99" # admin-addr loop with a multi-entry list (like ubongo)
# Never wt0-only; never touch the real mesh from a throwaway VM.
base__ssh_listen_mesh_only: false
base__mesh_enabled: false
# Allow SSH from the libvirt-NAT gateway (where the driver/ansible connect from) so the
# default-deny apply + the reboot don't lock out the harness. By source IP (interface-
# independent). This is the harness's lifeline; the admin-addr above is only exercised.
base__firewall_control_addr: "192.168.150.1"
```
- [ ] **Step 2: Create the "be ubongo" VM profile**
Create `tests/integration/profiles/ubongo.json`:
```json
{
"groups": ["control"],
"applies": [
{"playbook": "site.yml", "tags": ["base"]}
],
"extra_vars_files": ["overrides/ubongo.yml"],
"mem_mib": 2048,
"vcpus": 2
}
```
- [ ] **Step 3: Mark the askari overlay with its profile name**
In `tests/integration/overrides/askari.yml`, after the two header comment lines (before `base__firewall_apply: true`), add:
```yaml
integration_profile: askari
```
- [ ] **Step 4: Make `verify.yml` profile-aware (the test)**
Replace the entire contents of `tests/integration/verify.yml` with:
```yaml
---
# Integration verify (ADR-025). Outcome-based, profile-aware: the active profile is named by
# `integration_profile` (set in each profile's overlay). Each profile asserts its own success
# criteria; an unknown/unset profile fails loudly (never a silent pass).
- name: Verify the rebooted host
hosts: all
become: true
gather_facts: false
tasks:
- name: A known integration_profile must be set (no silent pass)
ansible.builtin.assert:
that:
- integration_profile is defined
- integration_profile in ['askari', 'ubongo']
fail_msg: "integration_profile must be set in the profile overlay (askari|ubongo)"
# ── askari profile — Docker host: published-port forwarding survives the reboot ──
# The load-bearing check probes the VM's published :80 FROM the controller (ubongo) — if
# base's forward-drop killed DNAT, this times out (the FRICTION 2026-06-17 #1 bug).
- name: (askari) Gather service facts
when: integration_profile == 'askari'
ansible.builtin.service_facts:
- name: (askari) Docker daemon is active
when: integration_profile == 'askari'
ansible.builtin.assert:
that: "ansible_facts.services['docker.service'].state == 'running'"
fail_msg: "docker.service is not running"
- name: (askari) Forward chain permits container traffic (drop-in loaded)
when: integration_profile == 'askari'
ansible.builtin.command: nft list chain inet filter forward
register: _fwd
changed_when: false
- name: (askari) Assert container forwarding is allowed (not pure drop)
when: integration_profile == 'askari'
ansible.builtin.assert:
that: "'accept' in _fwd.stdout"
fail_msg: >-
forward chain is pure drop — container forwarding will die on reboot
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
- name: (askari) Published port answers from the controller (DNAT + forward alive)
when: integration_profile == 'askari'
delegate_to: localhost
become: false
ansible.builtin.uri:
url: "http://{{ ansible_host }}/"
follow_redirects: none
status_code: [200, 301, 308, 404, 502, 503]
timeout: 10
register: _probe
retries: 5
delay: 6
until: _probe is succeeded
# ── ubongo profile — control node: INPUT-only default-deny survives the reboot ──
# SSH reachability across the reboot is proven by the harness itself (it re-SSHes and
# checks boot_id changed before this verify runs). Here we assert the ruleset shape.
- name: (ubongo) Read the live nftables ruleset
when: integration_profile == 'ubongo'
ansible.builtin.command: nft list ruleset
register: _nft
changed_when: false
- name: (ubongo) INPUT default-deny, forward permissive, admin-addr allow
when: integration_profile == 'ubongo'
ansible.builtin.assert:
that:
- "'hook input priority 0; policy drop;' in _nft.stdout"
- "'hook forward priority 0; policy accept;' in _nft.stdout"
- "'ip saddr 192.168.150.98 tcp dport 22 accept' in _nft.stdout"
- "'ip saddr 192.168.150.99 tcp dport 22 accept' in _nft.stdout"
fail_msg: >-
ubongo profile: expected input policy drop, forward policy accept (input-only),
and both admin-addr (192.168.150.98/99) SSH allows in the live ruleset.
```
- [ ] **Step 5: Validate the JSON + lint**
Run: `.venv/bin/python -m json.tool tests/integration/profiles/ubongo.json >/dev/null && echo OK` then `make lint`
Expected: `OK`, then a clean lint pass (`check-tags: OK`).
- [ ] **Step 6: Commit**
```bash
git add tests/integration/overrides/ubongo.yml tests/integration/profiles/ubongo.json \
tests/integration/overrides/askari.yml tests/integration/verify.yml
git commit -m "test(integration): add the 'be ubongo' profile (input-only default-deny)
A control-group VM that applies base with INPUT-only default-deny (forward
policy accept; admin-addr SSH allow). verify.yml is now profile-aware via an
integration_profile marker — the askari Docker/DNAT block is gated, and a ubongo
block asserts input drop + forward accept + the admin-addr rule. Enables
\`make test-integration HOST=ubongo\`. Mesh-hardening 2/3 (ADR-025).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 4: Validate on the integration harness (`make test-integration HOST=ubongo`) — the GREEN gate
> Runs a throwaway UEFI VM on ubongo: boots it, applies the base role with the ubongo
> overlay (INPUT-only default-deny), **reboots it**, and asserts the ruleset + SSH-returns.
> This proves the change survives a reboot before the real control node is ever touched
> (spec §cutover step 1; FRICTION signal-6). No code change / no commit — a validation gate.
- [ ] **Step 1: Ensure the vault is unlocked**
The run loads `inventories/production/group_vars/all/vault.yml` (symlinked into the run dir), which is decrypted at playbook load.
Run: `rbw unlocked || rbw unlock`
Expected: exits 0 (unlocked). If it prompts, the operator unlocks.
- [ ] **Step 2: Run the integration cycle**
Run: `make test-integration HOST=ubongo`
Expected (the `cycle`: up → apply → reboot → assert): the VM gets a `192.168.150.x` lease; `site.yml --tags base` applies cleanly; `… rebooted (boot_id changed), SSH back at 192.168.150.x`; then `VERIFY PASSED for boma-it-ubongo-…`. The VM is destroyed on success.
- [ ] **Step 3: On failure, read the diagnostics**
If it prints `VERIFY FAILED`, diagnostics are in `~/integration-runs/boma-it-ubongo-<id>/` (`nft.txt`, `console.log`, `journal.txt`). The likely suspects: the admin-addr/forward assertion (Task 1/3 wiring) or SSH not returning post-reboot (the `base__firewall_control_addr: 192.168.150.1` lifeline in the overlay). Fix the implicated task, re-commit, and re-run Step 2. Re-run `make test-integration-clean` first if a VM was left defined.
- [ ] **Step 4: Record the result**
Capture the `VERIFY PASSED` line in the task notes (this is the gate Task 5 step 1 depends on). No commit.
---
### Task 5: Live staged cutover (operator-supervised — NOT a subagent task)
> Touches the **real ubongo** (the control node Ansible runs from) and reboots it — lockout-
> risky. Run it interactively with the operator, in order, verifying each step before the
> next. The firewall auto-rollback timer (`base__firewall_rollback_timeout`, 45 s) +
> `wait_for_connection` over the live path is the safety net; the **on-prem physical console**
> is the permanent break-glass. Do NOT hand this to an unattended agent.
- [ ] **Step 1: Pre-checks (gate: Task 4 GREEN)**
- `rbw unlocked || rbw unlock`.
- SSH to ubongo over `wt0` from a road-warrior succeeds.
- SSH to ubongo from mamba on the LAN (`10.20.10.50`) succeeds.
- `.venv/bin/ansible ubongo -i inventories/production/ -m ping``SUCCESS` (over `10.20.10.151`).
- The physical console is reachable. If any path fails, STOP.
- [ ] **Step 2: Dry-run the firewall apply**
Run: `make check PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
Expected: the nftables diff shows `policy drop` on input, `iifname "wt0" … accept`, `ip saddr 10.20.10.151 … accept`, `ip saddr 10.20.10.50 … accept`, and the forward chain as `policy accept`. No errors.
- [ ] **Step 3: Apply the host firewall (auto-rollback armed)**
Run: `make deploy PLAYBOOK=site LIMIT=ubongo TAGS=firewall`
Expected: the firewall concern snapshots `/etc/nftables.rollback`, arms the 45 s `systemd-run` revert, applies the ruleset, `reset_connection``wait_for_connection` over `10.20.10.151` succeeds, then cancels the timer. If connectivity is lost, the timer reverts the ruleset within 45 s and the console is the fallback.
- [ ] **Step 4: Verify every path + forwarding still works**
```bash
# from a road-warrior over wt0, and from mamba on the LAN:
ssh sjat@100.99.146.14 true && echo "wt0 OK"
ssh sjat@10.20.10.151 true && echo "mamba-LAN OK" # run from mamba (10.20.10.50)
# Ansible self-path:
.venv/bin/ansible ubongo -i inventories/production/ -m ping
# a disallowed LAN host (e.g. 10.20.10.17) must now be refused/timeout on :22
# Docker egress (forward chain still permissive):
docker run --rm busybox wget -qO- https://cloudflare.com/cdn-cgi/trace | head -1
# libvirt-NAT forwarding intact — a fresh integration VM still reaches apt:
make test-integration HOST=ubongo # expect VERIFY PASSED (proves the NAT path survived)
```
Expected: `wt0 OK`, `mamba-LAN OK`, Ansible `SUCCESS`, the disallowed host refused, the Docker egress line returns, and the integration cycle passes.
- [ ] **Step 5: Reboot resilience — while the console is present (FRICTION signal-6)**
With the operator at the physical console, reboot ubongo (`sudo systemctl reboot`). After it returns, confirm SSH comes back on all paths **unaided**:
```bash
ssh sjat@100.99.146.14 true && echo "wt0 OK after reboot"
.venv/bin/ansible ubongo -i inventories/production/ -m ping
```
Expected: SSH returns with no manual intervention (no `ListenAddress`, so nothing to race). Only now is the cutover complete.
- [ ] **Step 6: Update STATUS + ROADMAP**
- In `STATUS.md`: in the `roles/base/` row of "Scaffolded but empty", change the firewall note — the `firewall` concern is now **applied to ubongo** as INPUT-only default-deny (it is no longer "not yet applied to any host"); note the `base__firewall_input_only` knob and that the forward default-deny still awaits the `docker_host` drop-in for real service hosts. Add the ubongo control-node row's "Pending" item for default-deny → done.
- In `docs/ROADMAP.md`: mark **mesh-hardening sub-project 2 (ubongo default-deny) done**; the remaining follow-on is sub-project 1 (askari SSH→`wt0` *redesign*) and sub-project 3 (NetBird ACL). Update the "Next step" section accordingly.
```bash
git add STATUS.md docs/ROADMAP.md
git commit -m "docs: ubongo INPUT-only default-deny applied (mesh-hardening 2/3 done)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
- [ ] **Step 7: Push**
Run: `git push origin main`
---
## Self-review (against the spec)
- **§ Design — INPUT-only default-deny** → Task 1 (forward-policy knob) + Task 2 (enabled on ubongo). ✓
- **§ Design — admin-addrs (operator workstations on LAN)** → Task 1 (`base__firewall_admin_addrs` + template loop) + Task 2 (`10.20.10.50` mamba, `10.20.10.17`). ✓
- **§ Design — no sshd ListenAddress change** → nothing touches `ssh.yml`/`sshd_hardening.conf.j2`; only nftables. ✓ (verified: Tasks 13 file lists exclude them).
- **§ allow-list** (lo, established, wt0, ssh-from-control, admin-addr, icmp; forward accept) → template already renders lo/established/wt0/control/icmp; Task 1 adds admin-addr + forward-accept. ✓
- **§ Why-safe (incident signals 1/2/3/6)** → signal 1 (forward accept, Task 1); signal 2 (no ListenAddress); signal 3 (ubongo keeps LAN + console); signal 6 (Task 4 harness reboot + Task 5 step 5 reboot-while-console). ✓
- **§ New & changed code** (defaults, template, molecule, group_vars/control, integration profile) → Tasks 13. ✓
- **§ admin raw-leases + revisit** → Task 2 comments record both leases + the OPNsense-reservation revisit trigger; backstop (wt0) noted; flagged in `FRICTION.md`. ✓
- **§ Testing** (Molecule render asserts; `make test-integration HOST=ubongo`; live checks) → Task 1 (Molecule), Task 4 (harness), Task 5 step 4 (live). ✓ Coverage split (default in Molecule, input_only on the VM) noted in Task 1.
- **§ Staged cutover (signal-6 order)** → Task 5 steps 17; reboot-recovery (step 5) precedes nothing that retires a break-glass (the console is permanent). ✓
- **§ Risks/rollback** → auto-rollback (Task 5 step 3), redundant paths + physical console, raw-lease backstop. ✓
- **Type/name consistency:** `base__firewall_input_only` (bool) and `base__firewall_admin_addrs` (list) are spelled identically in defaults, template, converge, group_vars, and the overlay. `integration_profile` is spelled identically in both overlays and the three gates in `verify.yml`. ✓
- **Placeholder scan:** no TBD/TODO; every code/command step shows the actual content. ✓

View file

@ -0,0 +1,237 @@
# Mesh SPOF — accept + targeted resilience — Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Accept askari's single-coordinator SPOF as a documented availability trade-off, and harden the one real gap — a `base` mesh knob that pins the coordinator FQDN in `/etc/hosts` on managed mesh hosts so a local-DNS hiccup can't strand the mesh.
**Architecture:** One additive, idempotent `base` `mesh`-concern task (a `/etc/hosts` line via `lineinfile`, gated on a new opt-in knob), Molecule-tested; plus documentation (accepted-risk R8 + an ADR-016 availability amendment + STATUS/ROADMAP). No new infra, no Terraform, no live-deploy gate.
**Tech Stack:** Ansible (`base` role, `lineinfile`), Molecule (Debian 13), Markdown docs.
**Spec:** `docs/superpowers/specs/2026-06-20-mesh-spof-accept-resilience-design.md`
## Global Constraints
- **FQCN always** (`ansible.builtin.*`); role defaults use the `rolename__var` namespace.
- **No new collection** — derive the coordinator FQDN with builtin `regex_replace` (NOT `urlsplit`, which would pull in `community.general`).
- The pin is **opt-in and additive**: gated on `base__mesh_enabled | bool` AND `base__mesh_coordinator_pin | length > 0`. Empty knob (the default) = a clean no-op. The coordinator host (`askari`/`offsite_hosts`) is **exempt** — leave its pin empty.
- **askari's coordinator IP = `77.42.120.136`** (stable WAN; the A record for `netbird.askari.wingu.me`); ubongo is in the `control` group.
- `make lint` clean + `rbw unlocked` before any commit (the pre-commit hook decrypts the vault).
- **No new infra** — no P2P, no second relay/coordinator, no Terraform. The coordinator off-site backup is **out of scope** (ADR-022 kickoff).
- Tags: the new task carries the `mesh` concern tag (it belongs to the mesh concern).
---
### Task 1: `base` mesh coordinator-FQDN `/etc/hosts` pin (DNS-resilience)
Add an opt-in knob that pins the coordinator FQDN (derived from `base__mesh_management_url`) to a stable IP in `/etc/hosts`, so a managed mesh host survives a local-DNS failure. TDD'd through the role's Molecule scenario (which already exercises the `mesh` concern with `manage: false`).
**Files:**
- Modify: `roles/base/defaults/main.yml` (add the knob after the mesh block, ~line 53)
- Modify: `roles/base/tasks/mesh.yml` (append the pin task)
- Modify: `roles/base/molecule/default/converge.yml` (add a fixture pin to the vars block)
- Modify: `roles/base/molecule/default/verify.yml` (assert the rendered `/etc/hosts` line)
- Modify: `inventories/production/group_vars/control/vars.yml` (set the pin for ubongo)
**Interfaces:**
- Produces: role default `base__mesh_coordinator_pin` (string, default `""`); when set + `base__mesh_enabled`, an `/etc/hosts` line `<pin-ip> <fqdn>` where `<fqdn>` is `base__mesh_management_url` minus scheme/port/path.
- [ ] **Step 1: Write the failing Molecule test (fixture + assertion)**
In `roles/base/molecule/default/converge.yml`, add one line to the `vars:` block (after `base__mesh_setup_key`, ~line 15):
```yaml
base__mesh_coordinator_pin: "203.0.113.9" # fixture coordinator IP (TEST-NET-3); pins the FQDN from base__mesh_management_url
```
In `roles/base/molecule/default/verify.yml`, append to the `tasks:` list (after the mesh no-op assertion at the end):
```yaml
- name: Read /etc/hosts (coordinator pin)
ansible.builtin.slurp:
src: /etc/hosts
register: _etchosts
- name: Assert the coordinator FQDN is pinned to the fixture IP (DNS-resilience / R8)
ansible.builtin.assert:
that:
- "'203.0.113.9 netbird.askari.wingu.me' in (_etchosts.content | b64decode)"
fail_msg: "base__mesh_coordinator_pin did not render the /etc/hosts coordinator pin"
success_msg: "coordinator FQDN pinned in /etc/hosts"
```
- [ ] **Step 2: Run Molecule to verify it fails**
Run: `make test ROLE=base`
Expected: FAIL at "Assert the coordinator FQDN is pinned…" — no pin task exists yet, so `/etc/hosts` has no such line.
- [ ] **Step 3: Add the default knob**
In `roles/base/defaults/main.yml`, after `base__mesh_version` (~line 53), add:
```yaml
# DNS-resilience (ADR-016 availability / accepted-risk R8): when set to the coordinator's
# stable IP, pin the coordinator FQDN (derived from base__mesh_management_url) in /etc/hosts
# so a managed mesh host survives a local-DNS hiccup (the 2026-06-18 incident class). Empty
# = no pin. The coordinator host itself (askari/offsite_hosts) is exempt — leave it empty.
base__mesh_coordinator_pin: ""
```
- [ ] **Step 4: Add the pin task**
Append to `roles/base/tasks/mesh.yml`:
```yaml
- name: Pin the NetBird coordinator FQDN in /etc/hosts (DNS-resilience, ADR-016 availability / R8)
ansible.builtin.lineinfile:
path: /etc/hosts
regexp: '\s{{ _coordinator_fqdn | regex_escape }}$'
line: "{{ base__mesh_coordinator_pin }} {{ _coordinator_fqdn }}"
state: present
vars:
_coordinator_fqdn: "{{ base__mesh_management_url | regex_replace('^https?://', '') | regex_replace('[:/].*', '') }}"
when:
- base__mesh_enabled | bool
- base__mesh_coordinator_pin | length > 0
tags: [mesh]
```
(`_coordinator_fqdn` strips the scheme then anything from the first `:`/`/``netbird.askari.wingu.me`. The `regexp` matches an existing ` <fqdn>` at line end so a changed IP updates in place — idempotent; absent → appended.)
- [ ] **Step 5: Run Molecule to verify it passes**
Run: `make test ROLE=base`
Expected: PASS — the new assertion is green and Molecule idempotence is clean (re-running the pin task reports `ok`, not `changed`). The idempotence pass is what proves the `regexp` matches the line it wrote.
> Note: the empty-knob no-op (the production default for non-mesh / coordinator hosts) is guaranteed by the `when: base__mesh_coordinator_pin | length > 0` gate, not a separate Molecule case — a single converge can't hold both var-states, and boma uses one default scenario per role. The fixture exercises the meaningful path (rendering + FQDN extraction + idempotence).
- [ ] **Step 6: Wire the production pin for ubongo**
In `inventories/production/group_vars/control/vars.yml`, after the `base__mesh_enabled: true` block, add:
```yaml
# DNS-resilience (ADR-016 availability / R8): pin the coordinator FQDN to askari's stable WAN
# IP in /etc/hosts so a local-DNS hiccup (the 2026-06-18 incident class) can't strand ubongo's
# mesh. askari (offsite_hosts) is exempt — it reaches the coordinator locally.
base__mesh_coordinator_pin: "77.42.120.136"
```
- [ ] **Step 7: Lint and commit**
```bash
rbw unlocked && make lint
git add roles/base/defaults/main.yml roles/base/tasks/mesh.yml \
roles/base/molecule/default/converge.yml roles/base/molecule/default/verify.yml \
inventories/production/group_vars/control/vars.yml
git commit -m "feat(base): pin the NetBird coordinator FQDN in /etc/hosts (mesh DNS-resilience)" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
### Task 2: Accept + document the SPOF (R8, ADR-016 amendment, STATUS/ROADMAP)
Record the single-coordinator SPOF as a conscious, revisitable trade-off and capture the availability analysis + recovery. Pure documentation; references the pin from Task 1.
**Files:**
- Modify: `docs/security/accepted-risks.md` (add row R8; bump the review date)
- Modify: `docs/decisions/016-mesh-vpn.md` (add the availability amendment subsection)
- Modify: `STATUS.md` (note the SPOF accepted + the coordinator-pin knob)
- Modify: `docs/ROADMAP.md` (mark sub-project 3 addressed; surface ADR-022 backup + ACL as next)
- [ ] **Step 1: Add accepted-risk R8**
In `docs/security/accepted-risks.md`, add this row to the table after R7:
```markdown
| R8 | **Single off-site mesh coordinator is an availability SPOF for remote mesh access**`askari` hosts the only NetBird management/signal/relay (ADR-016); while askari is down, every *relayed* peer (all of `ubongo`'s, by the deliberate default-deny posture) loses remote mesh reachability and the control plane pauses. The `netbird_coordinator` store also has **no off-site backup yet** (BACKUP.md), so an askari loss loses mesh control-plane state until rebuilt | Inherent to ADR-016's deliberate single off-site coordinator (sovereignty; survives a homelab outage). **Narrow blast radius:** the mesh is not a gateway (`wt0` routes only `100.99.0.0/16`) — LAN, intra-cluster, and local-service traffic are unaffected; only remote/off-LAN mesh access breaks, and only when off-LAN *and* askari is down at once. askari is a reliable always-on VPS; mitigations: client + managed-host coordinator-FQDN DNS pin (`base__mesh_coordinator_pin`; runbook), documented `/setup` rebuild | askari proves unreliable; the cluster grows to depend on the mesh for intra-node traffic; remote mesh access becomes business-critical; or the ADR-022 backup role lands (closes the state-loss half) |
```
Then update the closing line's date: change `_Last reviewed: 2026-06-18.` to `_Last reviewed: 2026-06-20.`
- [ ] **Step 2: Add the ADR-016 availability amendment**
In `docs/decisions/016-mesh-vpn.md`, add this subsection immediately before the `## Related` section:
```markdown
## Availability — an `askari` outage (amendment 2026-06-20)
The coordinator is deliberately **single** (one off-site host). Recorded here so its
availability envelope is explicit; accepted as **R8** (`docs/security/accepted-risks.md`).
The mesh is **not** a default gateway — `wt0` routes only the overlay CIDR (`100.99.0.0/16`);
normal traffic uses the host's default route. So an `askari` outage has a **narrow blast
radius**:
| Traffic | `askari` down |
|---|---|
| LAN device → LAN service (direct / via reverse proxy) | unaffected |
| node ↔ node over LAN IPs (cluster) | unaffected |
| node ↔ node same-LAN over mesh IPs | unaffected (direct P2P) |
| **road-warrior → `ubongo` (remote, relayed)** | **breaks** |
| mesh control plane (new enrol / ACL change / re-handshake) | pauses |
Only remote (off-LAN) mesh access to peers is lost, and only when off-LAN **and** `askari`
is down simultaneously. On-LAN access to `ubongo` never depends on the mesh (Recovery &
operations, above).
**Recovery:** rebuild the coordinator (`/setup` + re-enrol peers, M5) or restore from backup
once ADR-022 lands; the `netbird_coordinator` store backup is the **next sub-project** (its
gap is named in R8 and `BACKUP.md`). Client/road-warrior break-glass (reliable resolvers +
the coordinator-FQDN `/etc/hosts` pin) is in `docs/runbooks/netbird-client.md`; managed mesh
hosts get the same pin via `base__mesh_coordinator_pin`.
**Not pursued** (deliberately, given the narrow blast radius): direct P2P (punctures the
default-deny posture; only helps established sessions), a second relay (needs another public
host / reintroduces the home public surface), a second coordinator (unsupported by
self-hosted NetBird; against this ADR).
```
- [ ] **Step 3: Update STATUS.md**
In `STATUS.md`, in the `roles/base/` row, append to the end of the firewall/mesh description (before the closing ` |`): a sentence noting the pin and the accepted SPOF:
```markdown
The `mesh` concern also pins the coordinator FQDN in `/etc/hosts` (`base__mesh_coordinator_pin`, set for ubongo) so a local-DNS hiccup can't strand the mesh; the single-coordinator SPOF is an accepted availability risk (R8, ADR-016 availability amendment).
```
- [ ] **Step 4: Update ROADMAP.md**
In `docs/ROADMAP.md`, in the "Remaining mesh-hardening sub-projects" list, change item 3 from the SPOF-reduction "(next)" wording to **DONE**, and make the NetBird ACL the next item. Replace the current items 34 block with:
```markdown
3. ~~**askari relay-SPOF reduction**~~**DONE (2026-06-20)** — assessed + **accepted** as a
documented availability risk (R8 + ADR-016 availability amendment): the blast radius is
narrow (LAN/intra-cluster/local traffic never touch askari), so no P2P / second relay /
second coordinator was warranted. Hardened the one real gap — a managed-host coordinator-FQDN
DNS pin (`base__mesh_coordinator_pin`). The coordinator off-site backup gap is handed to ADR-022.
4. **NetBird ACL off Allow-All** to scoped policies (open mechanism question — no headless API path).
5. **ADR-022 backup kickoff** — off-site backup of the `netbird_coordinator` store (named in R8 /
BACKUP.md) as the first slice of the backup role (restic + the `fisi` pull node).
```
- [ ] **Step 5: Consistency check + commit**
```bash
grep -q "^| R8 " docs/security/accepted-risks.md && \
grep -q "Availability — an .askari. outage" docs/decisions/016-mesh-vpn.md && \
echo "docs OK"
```
Expected: `docs OK`.
```bash
rbw unlocked
git add docs/security/accepted-risks.md docs/decisions/016-mesh-vpn.md STATUS.md docs/ROADMAP.md
git commit -m "docs(security): accept the single-coordinator mesh SPOF (R8) + ADR-016 availability amendment" \
-m "Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>"
```
---
## Notes / out of scope
- **Coordinator off-site backup → ADR-022 kickoff** (next sub-project). Not built here.
- **Direct P2P / second relay / second coordinator** — deliberately not pursued (spec §Design).
- No live deploy is required to land this — the pin is additive/idempotent and applies to ubongo on the next routine `base` apply (`make deploy PLAYBOOK=site LIMIT=ubongo`, operator's discretion). Optional post-deploy spot-check: `getent hosts netbird.askari.wingu.me` on ubongo resolves to `77.42.120.136`.

View file

@ -0,0 +1,212 @@
# Design — Logging and log integrity (ship all logs to Loki)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** TODO 3.1 ("Decide how to manage logs"); makes concrete ADR-002's
"logs shipped to a central location" + "active alerting" controls; advances TODO 3.6
- **Becomes:** ADR-018 (this design is the basis for that ADR)
---
## Problem
boma wants **all logs in one queryable store** for three things: day-to-day
troubleshooting, spotting issues/trends over time, and **detecting intrusions /
malicious activity**. ADR-002 already commits in principle ("`auditd`… Logs shipped to
a central location if a log aggregation service is available"; "Active alerting wires
AIDE/`auditd`/`fail2ban`/Suricata into the monitoring/alerting stack… ties to the
Loki/Grafana effort"), and CAPABILITIES lists Loki (planned) + `askari` as the off-site
watchdog. What's undecided is the **architecture** and, critically, the **integrity**
dimension: an attacker who roots a host will try to clear logs to cover their tracks.
The key insight that frames the integrity question: **the biggest anti-tampering win is
that logs leave the host in near-real-time.** Once a line is in a store the attacker
doesn't control, wiping the local copy is futile. The remaining question is only *how
far* to harden the central store — set by the threat model.
## Decisions (the settled forks)
1. **Threat model — opportunistic + blast-radius**, per ADR-002 / accepted-risk R1.
Not forensic-grade. This sizes everything below.
2. **Ship all logs to an on-cluster Loki** — the single monitoring DB for
troubleshooting + trends. Near-real-time shipping already defeats per-host
track-covering.
3. **Split: a security-relevant subset ALSO ships off-site to `askari`, write-only.**
Tamper-resistant against full-cluster compromise, at bounded volume.
4. **Skip WORM/object-lock (Tier 3)** — recorded as accepted-risk R4; append-only push
+ off-site is the proportionate control.
5. **Disk-wear is a managed design parameter, not a blocker** — storage media choice +
bounded verbosity + tuned retention + wearout monitoring (Section: Retention & wear).
## Architecture & components
**Agent — Grafana Alloy on every host, installed by the `base` role.** Alloy reads
journald + container logs + the security sources (`auditd`, `authpriv`, `fail2ban`,
AIDE) on every host (docker_hosts, proxmox nodes, `ubongo`, `askari`) and ships them.
Placing it in `base` ties it to ADR-002's baseline "logs shipped to central" control.
**Two Loki instances, one Grafana:**
```
┌──────────────────── per host (base role) ─────────────────────┐
│ Grafana Alloy: collect journald + container + auditd/auth/... │
└──────────┬───────────────────────────────────┬────────────────┘
ALL logs │ security subset │ (over the NetBird mesh)
▼ ▼
┌────────────────────────┐ ┌──────────────────────────────┐
│ Loki (cluster) all logs│ │ Loki (askari) security only │
│ docker_host, NVMe, │ │ off-site, write-only push, │
│ bounded hot retention │ │ long retention, append-only │
└───────────┬────────────┘ └──────────────┬───────────────┘
└───────────────┬────────────────────┘
┌────────────────────────────────────┐
│ Grafana (cluster): both datasources │
│ dashboards + alerts (AIDE/auditd/ │
│ fail2ban/Suricata + log-silence) │
└────────────────────────────────────┘
```
- **Loki (cluster)**`loki` service role on a docker_host; **all** logs; monolithic
single-binary mode (ample at this scale); NVMe; bounded retention.
- **Loki (`askari`)** — the same role parameterised, deployed to the `offsite_hosts`
group; **security subset only**, **write-only**, long retention, tiny volume.
- **Grafana**`grafana` service role on the cluster; both Lokis as datasources (one
pane queries both); where ADR-002's "active alerting" lands.
Reuses what boma already has: `askari` (off-site, on the mesh per ADR-016) and the
`base`/service-role machinery.
## Data flow & the security subset
Each host's Alloy pipeline writes **everything** to the cluster Loki and a **filtered
copy** of security events to the `askari` Loki — a relabel/match stage tags security
sources (`security="true"`) and routes only those to the second `loki.write` target.
One agent, two destinations.
**Security subset** (high-value, bounded volume): `auditd` (auth, privilege, file
watches), `authpriv` (SSH, `sudo`), `fail2ban` (bans), AIDE (file-integrity reports),
**Suricata** (OPNsense isn't a `base` host, so it **syslog-forwards** alerts to the
ingest point), and key container security events (reverse-proxy 401/403, Authentik
login events, Docker daemon events).
**Write-only / append-only** (the tamper-resistance mechanism):
- The `askari` Loki push endpoint (`/loki/api/v1/push`) is reachable only over the
**NetBird mesh**, with a **push-only credential**; hosts hold *only* that.
- Loki's query/admin/delete APIs on `askari` are **not exposed to hosts** (localhost /
mesh-ACL'd to operator + Grafana). The push API has no edit/delete verb, so a
compromised host can **append but not read/edit/delete**. Deletion needs the
admin/compactor API or filesystem — unreachable from a host.
- The cluster Loki uses the same push-only credential, blocking per-host log-clearing
via API there too.
**Reliability:** Alloy buffers (WAL) and retries, so a brief `askari`/mesh outage
doesn't lose logs — they flush on reconnect with only a small local buffer.
## Security, integrity & residual risks
**Defeated:** opportunistic track-covering (`rm`/`vacuum`) — lines are already off the
host; **host pivot to the store** — an attacker rooting any cluster host can append but
not delete, and cannot reach `askari`'s admin plane. **The security trail survives full
cluster compromise.**
**Honest residual risks (conscious, recorded):**
1. **Append-only ≠ cryptographic WORM** — a root-on-`askari` attacker could edit chunk
files on disk. Skipping object-lock is **accepted-risk R4**; mitigated by `askari`
being minimal/hardened/operator-only/mesh-only.
2. **Un-shipped window** — a few seconds of not-yet-flushed logs live on the host;
near-real-time minimises it. Accept.
3. **Agent compromise (forward-looking)** — rooting a host lets the attacker stop *that
host's* Alloy or inject *future* false logs, but **cannot alter shipped history**.
4. **Detection as a feature** — a host that **goes silent** (Alloy stops) is an
**alert**; the tamper attempt becomes a signal. "Log-source silence" is wired into
Grafana alerting.
5. **Credential theft / `askari` outage** — a stolen push credential allows appending
noise, not deletion (bounded, rotatable); an `askari` outage buffers on hosts and
flushes on reconnect (a very long outage eventually drops oldest — monitor it).
**ADR-002 fit:** realises "logs shipped to central" + "active alerting"; the off-site +
append-only model is a clean blast-radius-containment enhancement for the opportunistic
threat model.
## Retention, sizing & disk-wear
**Sizing (estimates — intent-based until measured, like `/capacity-review`):** a 25
host homelab generates ~13 GB/day raw "typical" (≪1 GB/day quiet; 515 GB/day very
chatty); Loki compresses ~710× → ~0.10.4 GB/day stored; the security subset is
~1020% of that.
**Retention (tunable in `group_vars`):**
- **Cluster Loki (all logs):** bounded hot retention, start **3090 days** (~1035 GB
at 90d on NVMe).
- **`askari` Loki (security subset):** **1 year+** (~525 GB/yr) — small enough to keep
the security trail long for over-time detection.
- Defaults now; **re-measure real volume after a few weeks live** and tune.
**Disk-wear (the lore is real only for specific media/misconfig; mitigated as design
rules):** at boma's volume even ~1040 GB/day of amplified writes is decades of life on
a ~600-TBW/TB NVMe. Rules:
1. Log storage on **NVMe/SSD** (or **HDD** for a long-retention cold tier — sequential,
endurance-unlimited); **never SD/USB flash**.
2. **Bounded verbosity at source** (sane log levels, selective access logging, a
*targeted* `auditd` ruleset) — the one lever that controls wear *and* firehose size.
3. Tuned Loki **retention + compaction** so neither store grows unbounded.
4. **SSD wearout/TBW is a monitored metric** (Proxmox wearout %, `node_exporter`
smartmon) with an alert — wear is a graph, not a surprise. (Depends on the metrics
stack — see Dependencies.)
Capacity bookkeeping ties into ADR-012: a log-storage allocation line (cluster +
`askari`) and SSD-wearout as a tracked metric.
## Documentation & implementation changes
This is a substantial capability → its own ADR-018, with reconciliations:
| Doc / artifact | Change |
|---|---|
| ADR-018 (new) | Home of record: ship-all-to-Loki, the off-site write-only security subset, append-only model, skip-WORM (R4), disk-wear rules. |
| `base` role (when built) | Install + configure Alloy (all → cluster Loki; subset → `askari` write-only). |
| `loki` service role (new, when built) | One role, two deployments (cluster all-logs; `askari` security-subset write-only). `SECURITY.md` + `VERIFY.md`. |
| `grafana` service role (new, when built) | Both Lokis as datasources; dashboards + alerting (AIDE/`auditd`/`fail2ban`/Suricata + log-silence). |
| OPNsense (Ansible-managed) | Syslog-forward Suricata alerts to the ingest point. |
| ADR-002 | "Logs shipped to central" + "active alerting" bullets point to ADR-018. |
| `docs/security/accepted-risks.md` | Add **R4** — no cryptographic WORM for logs (append-only + off-site is the control). |
| `docs/CAPABILITIES.md` §3 | Loki → decided; add the off-site security sink + Alloy agent rows; mark the alerting wiring. |
| `docs/decisions/012-hardware-capacity.md` | Log-storage allocation (cluster + `askari`) + SSD-wearout tracked metric. |
| `STATUS.md` + `docs/TODO.md` (3.1 / 3.6) | Mark "how to manage logs" decided by ADR-018; rows as designed-not-built. |
| `vault.yml` | Push-only Loki credential (`vault.loki.*`). |
**Buildable now:** ADR-018 + the ADR-002/CAPABILITIES/ADR-012/accepted-risks/STATUS/TODO
reconciliations. **Deferred on the stack:** the Alloy-in-`base`, `loki`/`grafana`
service roles, OPNsense syslog config, and the live pipeline.
## Dependencies
- `base` role + service-role machinery (unbuilt) — STATUS.md.
- The running cluster + `askari` (`offsite_hosts`, designed) — ADR-016.
- OPNsense automation (for Suricata syslog forwarding) — ADR-007.
- The **metrics stack** (Prometheus / `node_exporter`) for SSD-wearout + log-silence
alerting — sibling effort, TODO 3.6.
## Deferred / out of scope
1. **WORM / object-lock (Tier 3)** — accepted-risk R4; revisit only if the threat model
shifts to targeted/forensic.
2. **The metrics pipeline** (Prometheus/`node_exporter`) — sibling effort; this spec is
**logs**. SSD-wearout + silence alerting depend on it.
3. **Cold archival beyond Loki retention** (export to backups) and **structured/parsed
per-service log standards** — future refinements.
## What was ruled out
| Option | Reason |
|---|---|
| Everything off-site on `askari` (no on-cluster Loki) | The firehose (tenshundreds of GB/yr) is disk-hungry on a small VPS; keep volume where storage is cheap (on-cluster) and send only the bounded security subset off-site. |
| WORM / object-lock for all logs | Forensic-grade cost for an opportunistic threat model — YAGNI (R4). |
| On-cluster-only logging (no off-site copy) | Doesn't survive compromise of the cluster Loki host; the security trail needs to be off-cluster + append-only. |
| Volatile (RAM-only) journald to cut writes | Risks losing logs on crash before shipping; persistent-with-size-caps + real-time shipping is safer. |
| Promtail / legacy agents | Alloy is the current unified Grafana collector and the V4-aligned choice; one agent for logs (and later metrics). |
See also: ADR-002 (security baseline — realised here), ADR-016 (mesh / `askari`),
ADR-007 (OPNsense / `askari`), ADR-012 (hardware/capacity), ADR-004 (service-role
standard), ADR-011 (health checks — distinct from this).

View file

@ -0,0 +1,206 @@
# Design — Mesh VPN (NetBird, self-hosted on `askari`)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** ADR-015 deferred item #1 (mesh VPN choice) and the `accepted-risks.md`
R3 "pending VPN choice" placeholder
- **Amends:** ADR-007 (retires the VLAN-99 OPNsense WireGuard design)
- **Becomes:** ADR-016 (this design is the basis for that ADR)
---
## Problem
`ubongo` (ADR-015) needs remote SSH access from anywhere (work PC, laptop, phone)
without exposing anything to the public internet. ADR-015 left the access mechanism —
the "mesh VPN" — deferred to this discussion.
Meanwhile ADR-007 already commits to **WireGuard-via-OPNsense** for the `vpn` VLAN
(VLAN 99, `10.99.0.0/24`): `askari` (the off-site Hetzner monitoring VPS) peers to
OPNsense, plus road-warrior clients. And `docs/CAPABILITIES.md` already flags the open
question: *"ADR-007 commits WireGuard-via-OPNsense; Netbird (mesh) is a real
alternative to weigh."*
So the real decision is three-cornered (plain OPNsense WireGuard vs NetBird vs
Tailscale), with an architectural sub-question of whether a mesh replaces or coexists
with the ADR-007 WireGuard.
## Decisions (as settled)
1. **Scope — the mesh *replaces* WireGuard.** A single overlay becomes the sole
remote-access path for `ubongo`, `askari`, and road-warrior clients. ADR-007's
VLAN-99 OPNsense WireGuard design is retired.
2. **Control plane — self-hosted, on `askari`.** Maximum sovereignty (boma already
self-hosts Vaultwarden, Forgejo, its own DNS), no third-party trust, and an off-site
coordinator that survives a homelab outage and stays out of the cluster it
administers.
3. **Tool — NetBird.** Self-hosting on `askari` selects NetBird: it is designed to be
self-hosted as a first-class, fully open-source stack. (Tailscale's self-host path
means Headscale, a separate third-party reimplementation with partial parity — ruled
out below.)
4. **Routing — NetBird agent on every (Linux) host**, not a subnet router. At boma's
scale (25 hosts, treated as individuals) the usual "agent everywhere" downside is
moot, and the `base` role already runs on every host, so enrollment is one uniform
role task. Avoids a routing single-point-of-failure and gives granular per-peer ACLs
that match ADR-007's firewall intent. **One exception:** OPNsense (FreeBSD) is not a
first-class NetBird agent target, so `mgmt`/gateway reachability is handled by a
single advertised route or by administering OPNsense from an on-LAN meshed peer.
5. **Identity — embedded local users** (Dex, built into the management container), not
a standalone Zitadel/Keycloak. YAGNI for a single operator; external SSO remains a
documented future option.
## Verified facts (ADR-014)
> verified: NetBird self-hosting architecture · NetBird docs · docs.netbird.io/selfhosted · 2026-06-05
> - Components: management + signal + dashboard + relay/TURN (Coturn). Since **v0.65**
> the core services are **merged into a single container**; deploy via Docker Compose.
> - Identity: since **v0.62**, built-in **local users** with an **embedded IdP (Dex)**;
> external OIDC IdPs (Zitadel, Keycloak, Authentik, Okta, …) are **optional**, not
> required.
> - Ports (behind reverse proxy): **TCP 80/443** + **UDP 3478** (STUN/TURN).
> - Host: a Linux VM + Docker Compose + a domain name; lightweight.
>
> verified: NetBird licensing · GitHub netbirdio/netbird · 2026-06-05
> - Dual license: **AGPLv3** for `management/`, `signal/`, `relay/`; **BSD-3-Clause**
> elsewhere. Fully open source, self-hostable, no open-core feature gating.
---
## Architecture & topology
A single NetBird mesh is the sole remote-access overlay, replacing ADR-007's VLAN-99
WireGuard. Data plane is peer-to-peer WireGuard; control plane is self-hosted NetBird
on `askari`.
**`askari`'s dual role.** `askari` (Hetzner, off-site, always-up, independent of the
cluster per ADR-007) runs the **NetBird management stack** (single container:
management + signal + dashboard + Coturn, behind a reverse proxy on TCP 80/443 + UDP
3478) **and** is itself a mesh peer. Off-site hosting is what makes the mesh survive a
full homelab outage and keeps the coordinator out of the cluster it administers (no
chicken-and-egg).
**Peers:**
- `askari` — coordinator + peer.
- `ubongo` (control/AI-worker host) — agent.
- All Linux managed hosts (`dns1/2`, `proxy`, …) — agent via the `base` role.
- Road-warrior clients — `mamba`, phone, work PC — agent/app.
- OPNsense / `mgmt` — the single non-agent exception (advertised route or LAN-side
admin from a meshed peer).
**Retired:** ADR-007's VLAN-99 WireGuard endpoint on OPNsense and the
`10.99.0.0/24` peer scheme. `askari` reaches `srv`/`mgmt` over the mesh under NetBird
ACLs instead of OPNsense routing `10.99.0.0/24`.
---
## Security model, ACLs, and attack surface
**ACL policy mirrors ADR-007's firewall intent** (NetBird is default-deny):
- `vpn` peers → `srv` **metrics ports only** (askari's monitoring scope).
- admin peers (`ubongo`, `mamba`) → `srv` + `mgmt` for administration.
- road-warrior clients → only what each needs; nothing by default.
**Enrollment via setup keys.** Hosts join non-interactively using NetBird **setup
keys**, stored in `vault.yml` as `vault.netbird.setup_key` and consumed by the `base`
role. Prefer ephemeral/scoped keys (ADR-002).
**Host firewall interaction.** NetBird creates a `wt0` mesh interface. The `base`
role's nftables default-deny allows inbound admin (SSH) **only on `wt0`**, denied on
the physical NIC — the pattern ADR-015 set for `ubongo`, now applied fleet-wide. Mesh
+ nftables are defence-in-depth.
**The new attack surface — a public control plane on `askari`.** Today `askari`
exposes a WireGuard UDP port; with NetBird self-hosted it exposes the **management API
+ dashboard (80/443)** and **Coturn (3478)** publicly, and the management API is
keys-to-the-kingdom for the whole mesh. Mitigations baked in:
- Dashboard/API behind TLS + the embedded IdP login; source-IP restrictions where
practical.
- `askari` runs `base` hardening (already a public managed host) and NetBird is
**version-pinned** (ADR-011) and patched on boma's cadence — self-hosting means
owning the CVE cadence (AGPLv3 server).
Net vs ADR-002: nothing from the **cluster** is publicly exposed; the only public
surface is on `askari` (a public VPS by design), shifting from "WireGuard port" to
"NetBird control plane."
---
## Recovery, bootstrap ordering, and operations
**Ansible's control path stays off the mesh.** `ubongo` is on the LAN and reaches the
fleet by **LAN IP** (ADR-009). The mesh only provides *external* reach to
`ubongo`/the fleet, so a mesh/coordinator outage never blocks on-LAN Ansible runs and
there is no chicken-and-egg in the critical path.
**Bootstrap order** (askari-first):
1. Stand up the NetBird coordinator on `askari`.
2. Enroll `ubongo`.
3. `base` role enrolls the rest of the fleet via setup keys from vault.
**Recovery.** Coordinator off-site on `askari` ⇒ the mesh survives a full homelab
outage. Two must-haves:
- **Back up NetBird's management datastore** off `askari` — encrypted, synced to
`ubongo`/`mamba`. If `askari` dies, restore the coordinator; peers re-enroll.
- Existing peer tunnels keep running on last-known config through a brief coordinator
outage; only changes/new enrollments need it live — so `askari` is important but not
instantly fatal.
**`askari` becomes Ansible-managed.** It joins the inventory under its own group and
gets the `base` role plus a dedicated **`netbird_coordinator` service role** (one
service = one role per ADR-004, with its own `SECURITY.md` per the service-role
standard). Agent install/enrollment lives in `base`.
**DNS & versions.** boma's `dns` role stays authoritative for `boma.baobab.band`;
NetBird's built-in DNS is scoped/off to avoid overlap. NetBird server (on `askari`)
and agents (via `base`) are version-pinned (ADR-011).
---
## Documentation & implementation changes
This is a substantial decision → its own ADR, with amendments linking to it.
| Doc | Change |
|---|---|
| ADR-016 (new) | Home of record for this design. |
| ADR-007 (network) | Replace the VLAN-99 WireGuard section + `10.99.0.0/24` scheme with the NetBird mesh; update the firewall-intent table and the `askari` external-monitoring section to ride the mesh. |
| ADR-015 (control host) | Resolve deferred item #1: mesh VPN = NetBird self-hosted on `askari`; update the access/recovery notes. |
| `docs/security/accepted-risks.md` | Replace R3 ("pending VPN choice") with the concrete residual risk: self-hosted coordinator = no third-party trust, but a public NetBird control plane on `askari` to harden + patch. |
| `docs/CAPABILITIES.md` | Resolve the VPN row (line ~29): decided — NetBird mesh, self-hosted on `askari`. |
| `STATUS.md` | Add rows (designed, not built): NetBird coordinator on `askari`; NetBird agent enrollment in `base`. |
| `base` role (when built) | Install + enroll the NetBird agent; nftables allows SSH only on `wt0`. |
| `netbird_coordinator` service role (new, when built) | Deploys the NetBird stack on `askari`; populated `SECURITY.md`; molecule scenario. |
| `requirements.yml` | Only if a task needs a new collection module (ADR dependencies policy). |
**Scope note:** like the `ubongo` work, most *implementation* here waits on the `base`
and service-role machinery that STATUS.md lists as not-yet-built. This spec settles the
decision and the doc reconciliation; the role tasks land when `base` is built.
---
## Deferred / out of scope
1. **External SSO IdP** (Zitadel/Keycloak) — embedded local users now; SSO later if a
second operator or service-SSO need appears.
2. **OPNsense mesh integration specifics** — the exact `mgmt` reachability mechanism
(single advertised route vs LAN-side admin) is settled during implementation when
OPNsense automation is built.
3. **The `base` / `netbird_coordinator` role implementation** — depends on the
unbuilt `base` role and service-role standard.
---
## What was ruled out
| Option | Reason |
|---|---|
| Plain OPNsense WireGuard (ADR-007 as-is) | No identity/ACL layer, manual peer config, OPNsense-centric; the operator wants a mesh with policy-based access and easy multi-device enrollment. |
| Tailscale (hosted coordinator) | Adds a third-party trust dependency for the control plane; against boma's self-hosting ethos. (Hosted coordinator's recovery benefit is matched by putting a self-hosted coordinator off-site on `askari`.) |
| Tailscale + Headscale (self-hosted) | Headscale is a third-party reimplementation of Tailscale's control server with partial feature parity and no official vendor support — weaker than NetBird's first-class self-hosting. |
| Mesh coordinator on the cluster | Recreates the chicken-and-egg ADR-015 escapes, and dies with the homelab. `askari` (off-site) instead. |
| Subnet router via `ubongo` | Makes `ubongo` a routing SPOF; `askari` would go blind to `srv` when `ubongo` is down even if services are healthy. Agent-per-host instead. |
| Standalone IdP (Zitadel/Keycloak) now | Heavy for a single operator; embedded local users (Dex) suffice. External SSO stays a future option. |
See also: ADR-007 (network), ADR-015 (control host), ADR-002 (security), ADR-011
(version pinning), ADR-004 (one service = one role), ADR-009 (TF↔Ansible handoff),
ADR-013 (heritage — V4 used WireGuard; NetBird is translated, not transplanted).

View file

@ -0,0 +1,203 @@
# Design — Service-UI acceptance verification (ADR-008 Level 4)
- **Date:** 2026-06-05
- **Status:** Approved design — pending implementation plan
- **Resolves:** ADR-015 deferred item #2 (browser-E2E verification harness); TODO 2.2
(browser portion) + TODO 2.3 (test users + manual-test instruction)
- **Expands:** ADR-008 Level 4 (currently a stub)
- **Becomes:** ADR-017 (this design is the basis for that ADR)
---
## Problem
ADR-008 defines testing Levels 13 (Molecule, staging deploy, external smoke) and a
**Level 4 stub**: "Claude drives a headless browser from `ubongo` against a deployed
service: loads the rendered UI, creates test users, exercises features, and hands the
operator a manual test script." Nothing below Level 4 actually exercises a service's
**application UI** — Molecule tests the role in a container, Level 2 confirms the stack
converges, Level 3 confirms public endpoints respond. None answer "does PhotoPrism
actually let me log in, upload a photo, and see a thumbnail?" (TODO 8.2).
The operator's original ask: *"Claude could spin up a browser and actually see the
generated service web-UIs to verify various things. Perhaps even generate test users
and test features and instruct me on tests as well."* That is TODO 2.2 (headless
browsing) + TODO 2.3 (test-user generation + manual-test instruction).
Today Claude "sees" a browser only **passively** — the `/screenshot` skill fetches
screenshots the operator took on `mamba`. This harness is the **active** counterpart:
Claude drives the browser itself.
## Decisions (the settled forks)
1. **Nature — Claude-driven exploratory.** Claude navigates the live UI with judgment
(look, click, reason about whether it works, notice anything off), not deterministic
scripts. This is the distinctive value; a scripted Playwright regression suite is
explicitly *not* built here.
2. **Mode — interactive, Claude-in-the-loop.** Follows from #1: exploratory judgment
can't be a headless cron gate. Scheduled smoke-testing stays out of scope (that is a
determinism job for health checks / Uptime Kuma later).
3. **Environment — staging, full exercise.** Claude creates test users and exercises
features (including destructive flows) against a *staging* deploy. Staging is a
rebuildable sandbox, so this resolves safety: no production-data risk, no prod
pollution.
4. **Auth — test users in Authentik (central IdP), real SSO flow.** Claude's browser
authenticates through Traefik + Authentik exactly as a real user would, faithfully
testing the real access path.
5. **Structure — per-service `VERIFY.md` backbone + free exploration.** Each service
role ships an acceptance spec of critical user journeys; Claude executes it *and*
explores beyond it. Repeatable + intent-capturing, without losing exploratory value.
## Scope
In scope: the **browser/UI** verification harness (TODO 2.2 browser portion) + the
**test-user** and **manual-test-instruction** standards (TODO 2.3) = ADR-008 **Level 4**.
Out of scope (siblings, noted not built): the other TODO-2.2 "live testing" methods —
API calls, `curl` pulls, log review. They share the spirit but are not browser work.
Also out: a scripted/CI regression suite; scheduled headless smoke checks.
---
## Architecture, mechanism, and workflow placement
**Mechanism.** Claude drives a real Chromium on `ubongo` via the **`playwright` Claude
Code plugin** (already earmarked in `claude-code-setup.md`, enabled when this lands).
No bespoke browser code — Claude calls the Playwright MCP tools (navigate, click, type,
screenshot, read DOM) and reasons over what it sees. Active counterpart to the passive
`/screenshot`-from-`mamba` pattern.
**Orchestration.** A boma skill/command — **`/verify-service <name>`** — run
interactively on `ubongo`. It:
1. Reads the service's `roles/<name>/VERIFY.md` acceptance spec.
2. Provisions/uses a test user in the **staging** Authentik.
3. Drives the browser through the real SSO flow into the staging service.
4. Executes the listed journeys exploratorily (judging pass/fail, screenshotting key
states) and free-explores.
5. Writes a dated verification report with linked screenshots.
6. Emits a manual-test checklist for anything it couldn't do.
**Pipeline placement.** Level 4 runs after Level 2 (staging deploy) and before
production promotion:
`build role → molecule (L1) → staging deploy (L2) → /verify-service (L4) → promote`.
It reaches the staging service over the LAN from `ubongo` (services on `srv`; resolved
via boma DNS), through Traefik + Authentik as a real user would.
**Boundaries (one unit, clear interface):** the skill *orchestrates*; `VERIFY.md`
*declares intent* (per service); Authentik *provides identity*; the report *captures
results*. Each is independently understandable and swappable.
---
## The `VERIFY.md` standard
Every service role ships a populated `roles/<service>/VERIFY.md`, copied from a new
template `docs/testing/service-verify-template.md` — parallel to how each role ships
`SECURITY.md` from `service-security-template.md`. It becomes a **role convention**
(every *service* role must have a populated `VERIFY.md`).
Contents:
- **Critical user journeys** — the acceptance criteria that define "working" for this
service (e.g. PhotoPrism: *SSO login → library loads → upload a test image →
thumbnail generates → search finds it*).
- **What good looks like** — states/screenshots to confirm.
- **Not browser-verifiable** — items to route to the manual-test handoff (hardware,
paid/external flows, subjective quality).
`/verify-service` reads `roles/<name>/VERIFY.md`, executes those journeys, and explores
beyond them.
## Test-user generation standard (TODO 2.3)
Test identities are provisioned in the **staging** Authentik (never the production IdP
— test accounts must not exist in prod):
- **Convention:** a dedicated `test` group / naming prefix (e.g. `test-<service>@…`) so
accounts are identifiable and bulk-removable.
- **Credentials:** ephemeral, generated per run (staging is rebuildable); held only for
the run. No test creds in `vault.yml`.
- **Idempotent:** reuse-or-create.
- **Teardown:** primary teardown is the staging rebuild (sandbox); the skill also
offers explicit cleanup of the `test` group.
## Reporting & manual-test handoff
- **Report:** `/verify-service` writes `docs/testing/reviews/YYYY-MM-DD-<service>.md`
(plus `latest.md`), mirroring `/review-repo``docs/reviews/` and
`/capacity-review``docs/hardware/reviews/`. It contains pass/fail per `VERIFY.md`
journey, observations, the test-user/env used, a verdict, and the manual-test
checklist. The committed markdown is the durable artifact.
- **Screenshots:** saved to a **git-ignored** dir on `ubongo` (PNGs would bloat the
repo); the report links them and inlines only a few key evidence shots.
- **Manual-test handoff (TODO 2.3):** anything Claude can't do — physical device,
paid/external flow, subjective judgment — becomes a **structured checklist** in the
report (numbered steps, expected result, why handed off). The operator runs them and
reports back. This is the "instruct me on tests" half of the vision, as a first-class
output.
## Safety
Even though staging is a sandbox:
- **Staging-only guard.** The skill refuses to run against production (verifies it is
pointed at the staging environment/inventory before acting) — an ADR-002-aligned hard
stop, since exploratory clicking is destructive by nature.
- **Confined blast radius.** Test users live only in the staging `test` group; the run
sticks to the target service.
- **No secrets leaked.** Screenshots can capture on-screen tokens/credentials, so the
git-ignored screenshot dir is also the safety boundary (evidence isn't committed by
default), and the skill avoids capturing credential screens.
---
## Documentation & implementation changes
This is a substantial capability → its own ADR-017, with reconciliations:
| Doc / artifact | Change |
|---|---|
| ADR-017 (new) | Home of record: harness, the five settled forks, `VERIFY.md` standard, test-user + manual-handoff standards, safety. |
| ADR-008 (testing) | Expand the Level 4 stub into the full definition; link ADR-017. |
| `docs/testing/service-verify-template.md` (new) | The `VERIFY.md` template (parallels `service-security-template.md`). |
| `.claude/commands/verify-service.md` (new) | The `/verify-service <name>` orchestrating skill. |
| `CLAUDE.md` | Role conventions: every *service* role must ship a populated `VERIFY.md`. Further reading: ADR-017. |
| `docs/security/service-checklist.md` | Add "passed Level 4 (`/verify-service`)" to the pre-production service-clearance gate. |
| `.gitignore` + `docs/testing/reviews/` | Ignore the screenshot dir; create the reviews dir (README/`.gitkeep`). |
| `STATUS.md` | Row: Level 4 verification — skill + template authorable; *running* deferred. |
| `docs/TODO.md` | Mark 2.2 (browser portion) + 2.3 addressed by ADR-017; note API/`curl`/log siblings remain. |
| `make new-role` scaffold | Scaffold `VERIFY.md` into new service roles (when that scaffold is next touched). |
**Buildable now** (no `ubongo`/Authentik/staging needed): ADR-017, the ADR-008
expansion, the `VERIFY.md` template, the `/verify-service` skill logic, the convention +
checklist + Further-reading edits, `.gitignore`/dir, STATUS/TODO. This spec yields real
working artifacts immediately — the skill and standards exist and are reviewable; only
the *live run* waits on the stack.
**Deferred** (needs the stack): actually running it (`ubongo` + `playwright` plugin +
Authentik + a staging deploy); the Authentik test-user provisioning automation;
per-service `VERIFY.md` files (need the service roles, which don't exist yet).
---
## Dependencies
- `ubongo` (ADR-015) — the host that runs the browser. Designed, not built.
- `playwright` Claude Code plugin — enabled when this lands (`claude-code-setup.md`).
- Authentik (CAPABILITIES §2, planned) — central IdP for test users + SSO.
- A staging environment with the service deployed (ADR-008 Level 2) — staging is
currently empty stubs.
---
## What was ruled out
| Option | Reason |
|---|---|
| Scripted Playwright regression suite | The operator wants exploratory judgment, not deterministic scripts; scripts add authoring/maintenance burden. A scripted layer could come later but is not this. |
| Scheduled headless smoke gate (cron) | Needs determinism, which the exploratory nature excludes; that role belongs to health checks / Uptime Kuma. |
| Verify against production | Exploratory clicking + test-user creation is destructive/polluting; staging sandbox instead. Production gets non-destructive checks elsewhere, not here. |
| Free-form exploration with no per-service spec | Flexible but non-repeatable and can miss a service's critical flow; `VERIFY.md` gives a backbone while keeping free exploration. |
| Staging bypasses SSO / per-app local users | Wouldn't exercise the real Traefik+Authentik access path; central test users in Authentik are faithful. |
| Commit screenshots to the repo | Repo bloat + secret-leak risk; git-ignored on `ubongo`, markdown report committed. |
See also: ADR-008 (testing — expanded), ADR-015 (control host — runs the browser),
ADR-002 (security), ADR-004 (one service = one role — `VERIFY.md` parallels
`SECURITY.md`), ADR-013/014 (heritage / knowledge sourcing).

View file

@ -0,0 +1,164 @@
# Design — Firewall strategy (two-layer model + shared catalog)
- **Date:** 2026-06-06
- **Status:** Approved design — pending implementation plan
- **Resolves:** TODO 3.5 ("Decide the firewall strategy — which firewall, ruleset,
per-host vs central")
- **Becomes:** ADR-020 (this design is the basis for that ADR)
- **Scope note:** This is the **strategy** ADR. It pins the architecture and
responsibilities; the detailed builds (host nftables in `base`, OPNsense-as-code) are
separate follow-up specs (see *Scope*).
---
## Problem
boma needs a firewall strategy that is **predictable, declarative, and defends the
stated threat model** (opportunistic external, lateral movement / blast radius,
operator/agent error — ADR-002). The ADRs already commit to pieces of this — `nftables`
default-deny on hosts (ADR-002), OPNsense at the perimeter (ADR-007), Docker with
`iptables: false` (ADR-004) — but no document ties them together: *which layer owns
what, where firewall intent is declared, and how the two layers stay consistent.*
Without that, ports drift open ad-hoc and "per-host vs central" stays unanswered.
The roles that would hold the host firewall (`base`, `docker_host`) are empty, and there
is no OPNsense automation yet — so this is greenfield strategy work.
## The two-layer model
Two firewall layers, each with a distinct job; the host layer adds deliberate
defense-in-depth for the one thing the perimeter structurally cannot see.
### OPNsense — perimeter + inter-VLAN
Owns everything *between zones* and at the edge:
- WAN edge (the internet boundary).
- Inter-VLAN policy: `lan`/`iot`/`guest``srv`, `mgmt` access, the documented
per-VLAN egress rules (ADR-007).
- **Structurally blind to intra-`srv` traffic**: services share the `srv` subnet
(VLAN 20), which is switched and never reaches the OPNsense gateway.
### Host nftables — host-local + east-west within `srv` (in `base`)
Runs on every Debian VM:
- **Default-deny inbound**; allow loopback + established/related.
- **East-west allowlist**: a service host accepts a connection only from declared
sources (e.g. the reverse proxy, a named peer). This is the lateral-movement control
OPNsense cannot provide — the blast-radius goal in ADR-002.
- **Permissive egress**: allow outbound + established/related. Per-VLAN egress
restriction stays at OPNsense (where it already lives, ADR-007). Rationale: host-level
egress allowlisting is high-friction (every DNS/NTP/update/registry/webhook call must
be enumerated) for limited additional benefit given OPNsense already bounds where each
VLAN can go.
- **Docker integration**: Docker daemon runs with `"iptables": false`; nftables owns all
filtering, including container traffic (ADR-004).
- **Guaranteed management plane**: loopback, established/related, and `wt0` (the NetBird
overlay, ADR-016) for SSH + Ansible are *always* allowed, independent of the catalog,
and the ruleset is applied atomically — so a malformed or empty catalog can never lock
out management. (ADR-016: SSH is allowed only on `wt0`, not the LAN.)
## The shared service catalog (single source of truth)
A central, declarative **service catalog** in `group_vars/` is the one source of truth
for firewall intent. This aligns with ADR-002's existing rule that "port definitions
live in `group_vars/` so rules stay in sync with deployed services," and keeps
connectivity *topology* (inherently cross-cutting) in inventory rather than in any one
self-contained service role (ADR-004).
Each entry describes a service's **ingress** as a list of allow rules:
```yaml
photoprism:
ingress:
- { from: reverse_proxy, port: 2342, proto: tcp }
reverse_proxy:
ingress:
- { from: lan, port: 443, proto: tcp }
```
`from` is **symbolic**, resolved at render time:
- a **host or group** → IP(s) from inventory;
- a **role** (e.g. `reverse_proxy`) → the host(s) filling it;
- a **VLAN/zone** (e.g. `lan`) → the subnet from the ADR-007 table.
Symbolic sources keep the catalog readable and resilient to IP changes.
### Each layer renders only its own slice
The same catalog feeds both layers; each filters for the rules it owns:
| Ingress rule | Host nftables | OPNsense |
|---|---|---|
| `from: reverse_proxy` (a `srv` peer) | allow proxy IP → port | — (intra-`srv`, invisible) |
| `from: lan` (cross-VLAN) | allow `lan` subnet → port | allow `lan` → host:port |
The dominant pattern falls out naturally: most services are **proxied** — their only
ingress is `from: reverse_proxy`; users reach them *through* the reverse proxy, which
alone carries `from: lan, port: 443`. This matches "services sit behind the reverse
proxy with authentication" (ADR-002).
"Shared catalog, each layer renders its own" was chosen over a single
connectivity-model-generates-both (too much machinery, tight coupling of two very
different rule domains) and over fully independent per-layer declarations (real drift
risk: a port opened on the host but not at OPNsense, or vice versa).
## OPNsense automation — owned here, mechanism deferred
OPNsense is **Ansible-managed** (CLAUDE.md: "OPNsense is entirely Ansible; do not reach
for a Terraform OPNsense provider"). It renders the **cross-VLAN slice** of the catalog
(every `from: <other-zone>` rule) plus the static ADR-007 facts (WAN edge, per-VLAN
egress, mgmt access, inter-VLAN defaults).
This ADR pins **what** OPNsense owns and that it renders from the shared catalog. The
**how** — config-XML templating vs the OPNsense API vs a plugin — is a substantial,
separate tooling decision, **deferred to the OPNsense-as-code follow-up spec**. Recorded
here as an explicit open sub-decision so it is not lost.
## Guardrails & enforcement
- **The catalog is authoritative.** If a port is not in the catalog, it does not exist.
This hardens the existing CLAUDE.md guardrail ("never open a firewall port ad-hoc on a
host") into a positive contract.
- **The `firewall` tag** (ADR-019) marks firewall tasks, so `--tags firewall` re-renders
rules on `base` and any service role that contributes them.
- **Drift detection (aspiration).** A deterministic check — in the spirit of
`scripts/check-tags.py` — compares each host's actual listening ports / live `nft`
ruleset against the catalog and flags anything undeclared. Ties to TODO 8.5
(`/security-review`) and the "undeclared open ports" pre-scan idea. Listed as a
consequence and future guardrail; not necessarily built in the first implementation.
## Consequences
- "Per-host vs central" is answered: **both**, with clear ownership — central perimeter
(OPNsense) + per-host default-deny with east-west allowlisting, fed by one catalog.
- Lateral movement within `srv` is constrained (the gap OPNsense can't close).
- One declarative catalog means no ad-hoc ports and no cross-layer drift on the shared
facts (ports, IPs, sources).
- Cost: the catalog and the render-per-layer machinery must be built and maintained;
east-west allowlisting adds per-service ingress declarations (mitigated by the
proxied-by-default pattern, which keeps most entries to a single line).
## Scope
**This ADR decides:** the two-layer model and each layer's responsibilities; host
nftables = default-deny inbound + east-west allowlist + permissive egress + guaranteed
management plane + Docker `iptables:false`; the shared `group_vars` service catalog as
single source of truth with symbolic sources; each layer renders its own slice; the
no-ad-hoc-ports guardrail.
**Deferred to follow-up specs (each its own brainstorm → plan):**
1. **Host nftables implementation** in `base` — exact catalog schema, nftables template
structure, Docker `iptables:false` integration, fail-safe ordering, Molecule tests.
The natural next spec.
2. **OPNsense-as-code** — the tooling mechanism + cross-VLAN rule rendering.
3. **Drift-detection check** — if/when we build it.
## Related
ADR-002 (security baseline: nftables default-deny, fail2ban, blast radius),
ADR-004 (Docker model: `iptables:false`), ADR-007 (network topology, VLANs, OPNsense,
per-VLAN egress), ADR-016 (NetBird mesh: SSH on `wt0` only), ADR-019 (`firewall` tag).

View file

@ -0,0 +1,219 @@
# Design — Host nftables firewall (the `firewall` concern of `base`)
- **Date:** 2026-06-06
- **Status:** Approved design — pending implementation plan
- **Implements:** ADR-020 deferred build #1 (host nftables in `base`)
- **Scope:** The **`firewall`-tagged concern of the `base` role only**. Other `base`
concerns (SSH hardening, fail2ban, auditd, packages, users) are separate future efforts.
Docker netfilter is deferred to the `docker_host` role.
---
## Problem
ADR-020 settled the firewall *strategy*: a per-host nftables layer doing default-deny
inbound + east-west allowlisting + permissive egress, rendered from a shared
`group_vars` service catalog. Nothing is built yet — `roles/base/` is empty. This spec
designs the concrete host firewall: the catalog schema, how rules are resolved and
rendered, how they are applied without locking out the host, and how it is tested.
Two hard constraints shape the design:
1. **Molecule runs in a privileged Docker container sharing the dev host (`ubongo`)
kernel netfilter** — applying real nftables rules there could mutate the live host.
So Level-1 testing renders and syntax-checks but does **not** apply.
2. **Lockout risk** — a bad ruleset can brick SSH/Ansible. On-cluster hosts have the
Proxmox console as break-glass; offsite `askari` (Hetzner) does not, cheaply.
## Scope decisions (settled in brainstorming)
- **Host firewall only**, coherent on any host (even one with no services). Docker
`iptables:false` + container forward/NAT/masquerade are **deferred to `docker_host`**,
which contributes rules via an extension hook (below).
- **Placement lives in the catalog** (`host:` | `group:` | `hosts:`), giving one source
of truth that also resolves symbolic sources. Proxmox HA/migration moves a *VM*
between physical nodes but the VM keeps its static `srv` IP and inventory identity, so
node-level failover is invisible to the firewall. A planned service relocation is a
one-line catalog edit + `--tags firewall` re-deploy (which re-renders opened ports
*and* every source resolution consistently). Within-group HA is handled by placing a
service on a `group`/`hosts` list — the allowlist then already covers every member.
- **Level-1 testing = render + `nft -c` syntax check, no apply.** Enforcement is
verified at Level 2 on staging VMs.
- **Auto-rollback safety net** on apply (critical for offsite `askari`).
## Role layout
Scaffold with `make new-role base`, then implement the firewall concern:
```
roles/base/
tasks/main.yml # include_tasks firewall.yml (tags: [firewall]); grows later
tasks/firewall.yml # install nftables, render, validate, safe-apply
filter_plugins/firewall_rules.py # pure catalog→resolved-rules resolver (pytest-unit-tested)
templates/nftables.conf.j2
defaults/main.yml # base__firewall_* behaviour knobs
handlers/main.yml
molecule/default/ # fixture catalog + inventory; converge + verify
README.md, meta/main.yml
```
`base` is infrastructure, not a *service* role, so the service-role `SECURITY.md` /
`VERIFY.md` conventions (ADR-004) do not apply. The firewall role import in a playbook
carries the `base` role-name tag (enforced by `check-tags.py`, ADR-019); the firewall
tasks within carry the `firewall` concern tag.
## Data model — shared catalog + zones
Two new **global inventory facts** (read by `base` now and OPNsense later, so plain
names, not role-namespaced) in `inventories/<env>/group_vars/all/firewall.yml`:
```yaml
# Zone → subnet (from ADR-007)
firewall_zones:
lan: 10.30.0.0/24
srv: 10.20.0.0/24
mgmt: 10.10.0.0/24
iot: 10.40.0.0/24
guest: 10.50.0.0/24
# Service catalog: name → placement + ingress
firewall_catalog:
reverse_proxy:
host: docker01 # placement: host | group | hosts:[...]
ingress:
- { from: lan, port: 443, proto: tcp }
photoprism:
host: docker01
ingress:
- { from: reverse_proxy, port: 2342, proto: tcp }
```
- **Placement** is exactly one of `host: <name>`, `group: <group>`, or `hosts: [<name>, …]`.
- **`from`** resolves three ways, checked in this order: (1) a key in `firewall_zones`
→ that subnet; (2) a key in `firewall_catalog` → that service's placement → host
IP(s) as `/32`; (3) an inventory group or host name → its IP(s) as `/32`. An
unresolvable `from` is a hard error (fail fast, never silently open/skip).
Role **behaviour knobs** stay role-namespaced in `roles/base/defaults/main.yml`:
| Default | Value | Purpose |
|---|---|---|
| `base__firewall_mgmt_interface` | `wt0` | interface SSH is accepted on (NetBird overlay, ADR-016) |
| `base__firewall_ssh_port` | `22` | SSH port allowed on the mgmt interface |
| `base__firewall_rollback_timeout` | `45` | seconds before auto-revert fires |
| `base__firewall_dropin_dir` | `/etc/nftables.d` | extension dir included by the ruleset |
## Resolution & rendering
The resolver is a **pure Python filter plugin**, `roles/base/filter_plugins/firewall_rules.py`,
exposing `resolve_firewall_rules(catalog, zones, inventory_hostname, hostvars)`. It:
1. selects catalog entries placed on `inventory_hostname` (matching `host`, membership
in `group`, or presence in `hosts`);
2. for each entry's `ingress` rules, resolves `from` to a list of source CIDRs (zone /
service-placement / group-or-host, per the order above);
3. returns a **deterministic, de-duplicated, sorted** list of
`{proto, port, sources: [cidr, …]}`.
Chosen over inline Jinja (unreadable, untestable) and a `set_fact` loop (awkward to
unit-test) — a filter plugin matches the house style of `check-tags.py` /
`capacity-scan.py` and is pytest-unit-testable in isolation. Host→IP resolution reads
`hostvars[<host>].ansible_host` (the static `srv` IP the Terraform-generated inventory
provides).
`tasks/firewall.yml` builds `base__firewall_resolved` from the filter; the template
renders that flat list:
```jinja
#!/usr/sbin/nft -f
flush ruleset
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
iif "lo" accept
ct state established,related accept
ct state invalid drop
iif "{{ base__firewall_mgmt_interface }}" tcp dport {{ base__firewall_ssh_port }} accept
ip protocol icmp accept
ip6 nexthdr ipv6-icmp accept
{% for r in base__firewall_resolved %}
ip saddr { {{ r.sources | join(', ') }} } {{ r.proto }} dport {{ r.port }} accept
{% endfor %}
}
chain forward { type filter hook forward priority 0; policy drop; }
chain output { type filter hook output priority 0; policy accept; }
}
include "{{ base__firewall_dropin_dir }}/*.nft"
```
A host with no catalog entries still gets a valid default-deny + management-plane
ruleset. The `include` is the `docker_host` extension hook (forward/NAT drop-ins).
Sorted resolved rules → stable diffs and deterministic tests.
## Safe apply (lockout protection)
`tasks/firewall.yml` renders `/etc/nftables.conf`; when it changes, a **linear**
safe-apply sequence runs (deliberately in tasks, not a handler, so the confirm/cancel
step is controllable — a small, justified deviation from the handler idiom, noted in the
role README):
1. **Validate**`nft -c -f /etc/nftables.conf`; fail the play if invalid, before
touching the live ruleset.
2. **Snapshot**`nft list ruleset > /etc/nftables.rollback` (empty/flush on first run).
3. **Arm revert** — `systemd-run --on-active={{ base__firewall_rollback_timeout }}
--unit=nft-rollback nft -f /etc/nftables.rollback` (transient timer, no `at`
dependency).
4. **Apply**`nft -f /etc/nftables.conf`.
5. **Confirm + disarm** — the next Ansible task running proves the connection survived →
`systemctl stop nft-rollback`. If the apply bricked connectivity, the play cannot
continue, the timer fires, and the host self-heals (the offsite-`askari` safeguard).
6. **Persist** — enable `nftables.service` so `/etc/nftables.conf` loads on boot.
`established/related` (rendered in the ruleset) means the in-flight Ansible session
survives the swap; atomic `nft -f` avoids partial states.
**NetBird dependency:** locking SSH to `wt0`-only assumes NetBird (ADR-016) is built.
Until then, `base__firewall_mgmt_interface` (and, if needed, an additional management
source) is set to a reachable path so the role is deployable independently. This is a
config knob, not a code dependency.
## Testing (ADR-008)
- **Level 1 / pytest** — unit-test `firewall_rules.py` against fixture catalogs: zone
resolution, service→host-IP resolution, `group`/`hosts` multi-host placement, a host
with no services, source de-dup/sort, and an unresolvable `from` raising. Mirrors
`tests/test_check_tags.py` (import the module, assert on return values).
- **Level 1 / Molecule** — fixture `firewall_catalog` + fixture inventory (host_vars/
group_vars) in the scenario; `converge` renders `/etc/nftables.conf`; `verify` asserts
(a) expected accept lines are present for the fixture and (b) `nft -c -f
/etc/nftables.conf` validates syntax. **No apply** (kernel safety).
- **Level 2 / staging** — real apply on staging VMs verifies enforcement *and* the
safe-apply + auto-rollback path (steps 25), which Level 1 cannot safely cover.
The Molecule base image is not guaranteed to ship `nft`. The role installs the
`nftables` package as its first firewall task, so by the time `verify` runs the `nft -c`
syntax check, `nft` is present (installed during `converge`).
## Open dependencies / notes
- **NetBird/ADR-016 unbuilt** — see the mgmt-interface knob above; full `wt0`-only
lockdown lands when NetBird does.
- The safe-apply orchestration (steps 25) has **no Level-1 coverage** by design; it is
integration-tested at Level 2. Called out so the gap is explicit.
## Scope summary
**Built here:** `firewall_catalog`/`firewall_zones` schema; `firewall_rules.py` resolver
+ pytest; `nftables.conf.j2` (default-deny input, mgmt plane, permissive egress, drop-in
`include` hook); safe-apply-with-rollback tasks; Molecule render/syntax scenario;
`base` role scaffolding (README, meta, defaults, handlers).
**Deferred:** Docker `iptables:false` + container forward/NAT (→ `docker_host` spec, via
the drop-in hook); OPNsense rendering from the same catalog (→ OPNsense-as-code spec);
drift-detection check (ADR-020); all other `base` concerns.
## Related
ADR-020 (firewall strategy), ADR-002 (security baseline), ADR-004 (Docker model —
`iptables:false`, one service = one role), ADR-007 (VLANs/subnets), ADR-008 (testing
levels), ADR-016 (NetBird mesh — SSH on `wt0`), ADR-019 (`firewall` tag).

View file

@ -0,0 +1,188 @@
# Design — Ansible tagging standard (targeted, predictable runs)
- **Date:** 2026-06-06
- **Status:** Approved design — pending implementation plan
- **Resolves:** TODO 3.7 ("Define a tagging standard that lets us target runs without
over-tagging") and TODO 3.11 ("Deliberate tagging strategy") — the same thread
- **Becomes:** ADR-019 (this design is the basis for that ADR)
---
## Problem
boma wants to run playbooks **targeted** — a single service, a single layer, or a
single cross-cutting concern — and to do so **transparently and predictably**: you
should be able to look at a `--tags` invocation and know exactly what it will and won't
touch. CLAUDE.md already mandates that every task be tag-filterable, but no *vocabulary*
or *naming convention* exists. Without one, tags proliferate ad-hoc per role and the
"predictable" property is lost — and the TODO explicitly warns against the opposite
failure mode, **over-tagging**.
The repo is effectively greenfield for this: `base` and `docker_host` are empty, and the
only tags in existence are `[base]`/`[docker]` in `site.yml` and `[bootstrap]` in
`bootstrap.yml`. So we can bake the standard into role-authoring conventions *before*
there are a dozen service roles to retrofit.
## Targeting axes (what we want to slice by)
1. **Layer / role**`--tags base`, `--tags docker`
2. **Single service**`--tags photoprism`, `--tags traefik`
3. **Concern / function**`--tags firewall`, `--tags logging`, …
Lifecycle phases (bootstrap/config/deploy) are **not** a tag axis — `bootstrap.yml` vs
`site.yml` already separate those as whole playbooks.
Key simplification: because of ADR-004 (*one service = one role*, role name = service
name), axes 1 and 2 are the **same mechanism** — a tag equal to the role name. Only the
concern axis needs a curated vocabulary.
## Approach (chosen): two-tier tagging
**Tier 1 — role/service tag (mechanical).** The tag *equals the role name*, applied
**once** at the role-import level in the playbook:
```yaml
roles:
- role: photoprism
tags: [photoprism]
```
Ansible propagates the tag to every task in the role. This covers both the layer/role
and single-service axes with one rule and **zero per-task burden**.
**Tier 2 — concern tag (curated).** A small **closed, documented list** of cross-cutting
concern tags, applied per-task/block **only where a task genuinely belongs to that
concern**. `--tags firewall` then hits firewall tasks in `base` and in every service
role.
Rejected alternatives: *concern-only/flat* (loses natural `--tags <service>` ergonomics);
*rich multi-dimensional* (role+service+concern+lifecycle+ad-hoc per task) — that is
precisely the over-tagging the TODO warns against.
## The closed concern list
Litmus test for earning a spot: a concern must (a) appear in **2+ roles**, (b) be
something you'd realistically want to run as a slice on its own, and (c) not overlap
confusingly with another.
**Baseline concerns** (mostly in `base`, some echoed in service roles):
| Tag | Covers |
|-----|--------|
| `packages` | apt package install/management |
| `users` | accounts, groups, sudo |
| `firewall` | nftables rulesets & port definitions (ADR-002) |
| `hardening` | security baseline — sshd config, fail2ban, auditd, sysctl |
| `logging` | Alloy / log-shipping config (ADR-018) |
| `monitoring` | metric exporters / health checks |
**Service concerns** (in every service role, ADR-004):
| Tag | Covers |
|-----|--------|
| `config` | render templated config/compose files to disk — **no restart** |
| `deploy` | bring services up / restart (`compose up -d`) |
| `proxy` | reverse-proxy + TLS registration (Traefik routes, Authentik) |
Nine tags total. The `config`/`deploy` split is deliberate and high-value: `--tags
config` re-renders and lets you diff configuration without bouncing services; `--tags
deploy` does the restart.
`backup` and `secrets` are **intentionally omitted** until the roles that need them
exist — they enter via the extend process, not speculative reservation.
## `always` / `never` policy
boma uses Ansible's two built-in special tags, narrowly:
- **`always`** — reserved strictly for **cheap preflight assertions** (vault unlocked,
OS is Debian 13, required vars present). Ensures even `--tags config` runs its safety
guards.
- **`never`** — reserved for **destructive/expensive opt-in tasks**, each paired with a
descriptive tag (e.g. `never, force_pull` or `never, restore`). They never run unless
explicitly named, keeping dangerous actions out of normal runs. The descriptive
partner tag is a documented `never`-paired opt-in (allowed by the linter).
## Predictability principle: tags are union-only
`--tags a,b` runs tasks tagged a **OR** b — Ansible has no native AND. Rather than fight
this, we make it an explicit principle: **boma targets one axis at a time***either* a
role/service (`--tags photoprism`) *or* a concern (`--tags firewall`), never an
intersection like "photoprism's firewall only." If that is ever genuinely needed, the
answer is "just run `--tags photoprism`" (idempotent and fast). Designing for
intersection is the over-tagging trap; we decline it on purpose.
## Reconciling the existing CLAUDE.md rule
CLAUDE.md currently says *"every task must have at least one tag."* Under the two-tier
model the role tag is applied **once at the play/import level** and **inherited** by
every task, so tasks are always reachable without hand-tagging each one. The rule is
**reworded** to:
> Import each role with its role-name tag (once, at the play level). Within a role, tag a
> task/block with a concern tag from the approved list **only where it genuinely belongs
> to that concern** — don't invent tags or tag for tagging's sake.
This directly resolves the "without over-tagging" tension.
## Terraform / Proxmox VM tags (metadata only)
Formalize the convention that already half-exists in `staging/main.tf`
(`tags = ["staging", each.value.group]`). Every TF-managed VM gets exactly three tags:
| Tag | Value | Purpose |
|-----|-------|---------|
| env | `staging` \| `production` | which environment |
| role/group | `docker_hosts`, `proxmox_hosts`, … | matches the inventory group |
| managed-by | `terraform` | distinguishes IaC VMs from hand-made ones |
Set as `tags = ["${env}", each.value.group, "managed-by=terraform"]` in the env
`main.tf` (env is constant per directory).
**Explicit non-goals** (stated so nobody wires them up later): these tags are **pure
metadata for transparency** — glanceable in the Proxmox UI. They do **not** drive
run-targeting and do **not** feed inventory. `scripts/tf_to_inventory.py` keeps building
groups from the `group` output field, which stays the single source of truth.
## Enforcement
A small **lint check wired into `make lint`**: a script collects every `tags:` value
across `roles/` and `playbooks/` and fails if any tag is not in the allowed set:
```
{role names} {9 concern tags} {always, never} {documented never-paired opt-ins}
```
The allowed concern list (and the `never`-paired opt-ins) live in **one
machine-readable file, `tests/tags.yml`**, which both the linter reads and the ADR
documents — so doc and enforcement cannot drift. This is more honest than ansible-lint's
limited built-in tags rule. A unit test (mirroring `tests/test_capacity_scan.py`) covers
the checker.
## The "propose to extend" process
To add a concern tag: (1) add it to `tests/tags.yml`; (2) add a row to the ADR-019 table
with a one-line justification showing it passes the litmus test (cross-cutting, 2+
roles, distinct). That is the whole gate — lightweight, but it leaves a paper trail.
## Deliverables
- **New `docs/decisions/019-tagging.md`** — the standard: rationale, two-tier model,
concern table, union-only principle, `always`/`never` policy, Proxmox tag convention,
extend process.
- **`tests/tags.yml`** — machine-readable allowed concern list + `never`-paired opt-ins.
- **Lint checker script** (e.g. `scripts/check-tags.py`) + **`make lint`** wiring +
**`tests/test_check_tags.py`**.
- **CLAUDE.md** — reword the tag bullet under *Ansible conventions*; add the Proxmox tag
convention under *Terraform conventions*; add ADR-019 to *Further reading*.
- **`terraform/environments/{staging,production}/main.tf`** — apply the three-tag
convention.
- **`docs/TODO.md`** — mark 3.7 and 3.11 DECIDED (ADR-019).
- **`docs/CAPABILITIES.md`** — note targeted runs as a capability, if it fits.
## Out of scope
- Intersection targeting (role ∩ concern) — declined on purpose (see principle).
- Lifecycle-phase tags — handled by separate playbooks.
- Proxmox tags feeding inventory or run-targeting — metadata only.
- `backup`/`secrets` concern tags — added later via the extend process.

View file

@ -0,0 +1,214 @@
# Design — Operational access (ADR-021)
- **Date:** 2026-06-09
- **Status:** Approved design — pending implementation plan
- **Implements:** New ADR-021. Resolves TODO 3.2 (API / API access) and TODO 7.2
(what to set up on hosts, given direct access will be rare).
- **Amends:** ADR-016 (SSH was mesh-only; now also from `ubongo`'s LAN address) and
ADR-020 (adds an `ssh-from-control` symbolic catalog source).
- **Scope:** The operational-access *doctrine* + the declarative `access__*` data model,
the rendered `ACCESS.md` record, and the `/check-access` verifier design. It does **not**
build any of it — `base`/service roles and live hosts don't exist yet. Designed now,
built when there is something to access.
---
## Problem
boma is built security-first: nftables default-deny, SSH reachable only on the NetBird
`wt0` mesh interface (ADR-016), every service behind the reverse proxy + SSO, no ad-hoc
ports (ADR-002/020). That posture is correct — but it leaves an unanswered operational
question: **when a service or host breaks, how does the operator (and the AI working on
boma's behalf from `ubongo`) actually get in to troubleshoot it?**
Experience on similar projects shows troubleshooting is far more effective with *several*
documented ways in — SSH, container exec, logs, an admin API — so a single broken path
doesn't mean blind. Today boma has no standard guaranteeing those paths exist, are
documented, or still work. The risk is the classic one: the access you assumed you had is
stale exactly when you need it (key rotated, API disabled, token expired).
boma already has the right *shape* for the fix. Service roles carry record docs —
`SECURITY.md` (security answers) and `VERIFY.md` (acceptance spec) — gated by the service
checklist and the `new-role` runbook. What's missing is the third sibling: an
**operational access record**, plus the doctrine behind it.
Two constraints shape the design:
1. **Minimal attack surface is non-negotiable.** "Multiple ways in" must mean multiple
paths over the *trusted* interface, never new exposed ports. Resolution: all routine
access runs over the mesh from `ubongo`.
2. **A documented path that is never tested drifts.** It fails exactly when needed. So
the structured access facts must be *data* that both renders the doc and drives an
active verifier — the two can then never disagree.
## Decisions settled in brainstorming
- **Access is a deployment deliverable.** The deploy that creates a host/service also
records and (by design) proves its access paths. Not rediscovered under pressure.
- **All routine access over the mesh** (`wt0`, from `ubongo`). No new LAN/WAN exposure.
- **Two layers:** a host-level access baseline (resolves TODO 7.2) and a per-service
access record (resolves TODO 3.2).
- **Baseline paths, every service:** host SSH, container exec + compose, logs
(Loki/Grafana, ADR-018), and the service admin API where one exists (`n/a` otherwise).
- **A new first-class sibling record** `ACCESS.md` (next to `SECURITY.md`/`VERIFY.md`),
**rendered from declarative data** — not hand-written prose (the firewall-catalog
philosophy of ADR-020 applied to access).
- **Active verification designed in:** a `/check-access` skill probes the declared paths
and reports which are live — the access analogue of `/verify-service` (ADR-017).
- **Direct LAN SSH from `ubongo` only** is added as a second, mesh-independent path
(amends ADR-016); all other LAN hosts stay blocked by default-deny.
## The doctrine
> **Every host and every service guarantees at least one documented, verifiable way in
> for operational troubleshooting — and the deploy that creates it also records and
> proves it.**
### Two layers
- **Host layer** (TODO 7.2). Every host, via the `base` role, guarantees a fixed access
baseline: SSH over `wt0` and from `ubongo` (below), Docker/Compose tooling present, and
log shipping live (Alloy → Loki; ADR-018). Little is *exposed*; a known, uniform set of
paths exists over the mesh. This is boma's answer to "what every host runs for access."
- **Service layer** (TODO 3.2). Every service role guarantees and records its paths:
container exec + compose management, its Loki log labels, and its admin API where one
exists (enabled, token in vault, endpoint + health probe documented) or explicit `n/a`.
### The three-tier access ladder
1. **`wt0` mesh SSH — primary.** WireGuard *cryptographically authenticates* the peer
before SSH sees it. The preferred path (ADR-016's original rationale).
2. **LAN SSH from `ubongo` — secondary, mesh-independent.** Most hardware (all but
`askari`) shares a LAN. SSH from `ubongo`'s LAN address is allowed via a new catalog
source, giving a fallback that survives a NetBird/`wt0` outage. It is gated by *source
IP* (spoofable on a LAN) **plus** the standing keys-only + fail2ban SSH hardening, so
the marginal cost is "SSH daemon reachable from the LAN broadcast domain from one
trusted host" — modest and deliberate. All *other* LAN hosts remain default-denied.
3. **Console — break-glass.** Mesh-*and*-LAN-independent, recorded per host class, not
used for routine work:
- **Cluster VMs** → Proxmox serial/VNC console (`qm terminal` / console via the
Proxmox host) — independent of the guest network, `wt0`, and even a broken guest
nftables ruleset.
- **`askari`** (bare-metal Hetzner) → provider rescue/console.
- **`ubongo`** (physical) → local console.
A total mesh outage therefore still leaves exactly one documented way in to each box.
## The declarative access data model (Approach B)
Structured access facts live as **data** — the single source of truth that both renders
`ACCESS.md` *and* tells `/check-access` what to probe, so doc and verifier cannot diverge.
### Service-layer — `access__*` in each service role's defaults
```yaml
access__service: photoprism
access__compose_project: photoprism # docker compose -p <this>
access__compose_path: /opt/photoprism/compose.yml
access__containers: [photoprism, photoprism-db] # exec targets
access__log:
loki_labels: { service: photoprism } # how to query logs (ADR-018)
access__api:
enabled: true
base_url: "https://photoprism.host:2342" # reachable over the mesh
firewall_ref: photoprism-api # the catalog entry that opens it (ADR-020)
auth: { type: token, vault_ref: "vault.photoprism.api_token" }
health_path: "/api/v1/status" # what /check-access pings
# where the service has no API:
# access__api: { enabled: false, reason: "<none upstream>" }
```
**Single-source-of-truth rule:** `access__api` **never opens a port**. It `firewall_ref`s
the entry in the `group_vars` firewall catalog — ADR-020 stays the sole owner of
*exposure*. The access data adds only *how to use* the path (endpoint, token ref, health
probe). No duplication, no ad-hoc ports (CLAUDE.md: ports only in the catalog).
### Host-layer — a fixed baseline, stated once
The host baseline (SSH on `wt0` + from `ubongo`, Docker/Compose present, Alloy live) is
uniform, so it is asserted by `base` and recorded once at the host/group level — not
re-stated per service. The break-glass console per host class is recorded with it.
## The rendered record — `ACCESS.md`
`ACCESS.md` is **rendered** from the `access__*` data, with a prose tail for the genuinely
narrative parts:
- **Access paths (generated)** — a table: each path (mesh SSH, LAN-SSH-from-`ubongo`,
exec/compose, logs, API), its tier (primary / secondary / break-glass), and the exact
invocation (`ssh host`, `docker compose -p <project> …`, the Loki query, the `curl`
against the API health path).
- **Break-glass (generated from host class)** — the Proxmox/provider console line.
- **Operational notes (prose)** — service quirks, gotchas, "if X is wedged, do Y." The
part a template cannot know.
A `docs/access/service-access-template.md` defines the shape, alongside the existing
security/verify templates.
## The verifier — `/check-access` (designed now, build-pending on infra)
Runs from `ubongo`; turns the `access__*` data into live probes. Invoked
`/check-access <service>` (or `<host>` for the host baseline). The access analogue of
`/verify-service` (ADR-017).
| Path | Probe | Green = |
|---|---|---|
| `wt0` mesh SSH | connect over mesh, run `true` | reachable + key works |
| LAN SSH from `ubongo` | connect via LAN addr, run `true` | reachable + key works |
| exec + compose | `docker compose -p <project> ps`; exec `true` in each container | stack up, exec works |
| logs | query Loki for `loki_labels`, expect recent lines | logs flowing |
| admin API | `curl` the `health_path` with the vault token | 2xx |
| break-glass | reachability of the Proxmox/provider console endpoint only | console host reachable |
- **Break-glass is checked for reachability, not exercised** — firing a serial console is
invasive; the verifier confirms the fallback *exists* without disrupting anything.
- **Output:** a pass/fail table; on any red, it names the path and the likely cause
("API token in vault stale", "Alloy not shipping", "`ssh-from-control` catalog source
missing"). The payoff: not "the doc *says* you can get in" but "verified — three of four
paths green right now, here's the broken one."
- **Status:** designed now, build-pending on infra (needs live hosts + staging + vault),
exactly like `/verify-service` under ADR-017.
## Governance — so it can't be forgotten
Three light touches mirror how `SECURITY.md`/`VERIFY.md` are enforced:
1. **Service checklist** (`docs/security/service-checklist.md`) gains one item: *"Access
paths declared (`access__*`), `ACCESS.md` rendered, `/check-access` green — or
deviation recorded in `accepted-risks.md`."*
2. **`new-role` runbook** (`docs/runbooks/new-role.md`) gains a step: fill `access__*`,
render `ACCESS.md`, run `/check-access`.
3. **`make new-role` scaffold** drops a stub `access__*` block + the `ACCESS.md` template
into the role — the same way roles already get `SECURITY.md`/`VERIFY.md` stubs, so it
is structurally impossible to ship a service role with no access record.
## Repo wiring
- **`docs/decisions/021-operational-access.md`** — the new ADR (doctrine, both layers,
the three-tier ladder, break-glass, the `access__*` model, `/check-access`).
- **`docs/decisions/016-mesh-vpn.md`** — amend: SSH on `wt0` **and** from `ubongo`'s LAN
address (was mesh-only). Cross-link ADR-021.
- **`docs/decisions/020-firewall.md`** — note the new `ssh-from-control` symbolic source.
- **`docs/access/service-access-template.md`** — the rendered `ACCESS.md` shape.
- **`docs/security/service-checklist.md`** — the one new gate item.
- **`docs/runbooks/new-role.md`** — the fill/render/`check-access` step.
- **`CLAUDE.md`** — `ACCESS.md` under "Role conventions"; ADR-021 in Further reading.
- **`STATUS.md`** — rows: ADR-021 doctrine *(designed)*; `ssh-from-control` catalog source
*(designed, builds with `base` firewall)*; `/check-access` *(designed, build-pending)*.
- **`docs/TODO.md`** — mark 3.2 and 7.2 DECIDED → ADR-021.
## What is buildable now vs later
- **Now:** the doctrine, ADR-021, the `ACCESS.md` template, the checklist/runbook/scaffold
wiring, and the `ssh-from-control` catalog source (the `firewall` concern of `base`
already exists, so the source can land with it).
- **Later (build-pending on infra):** `/check-access` *running*, and per-service
`ACCESS.md` *files* — both wait on service roles + live hosts. Designed now, built when
there is something to verify.
## Out of scope
- Building `base`'s non-firewall concerns, any service role, or live hosts.
- Broader LAN SSH (a management VLAN) — explicitly rejected; `ubongo`-only.
- Exercising (vs reachability-probing) the break-glass console.
- Any access path that is not over the mesh or the one `ubongo` LAN source.

View file

@ -0,0 +1,164 @@
# Design — ADR structure & lifecycle
- **Date:** 2026-06-10
- **Status:** Approved design — implementation plan to follow
- **Resolves:** the absence of a written standard for how ADRs in
`docs/decisions/` are structured. The newest ADRs (019022) have converged on a
clean pattern (`Status``Context``Decision``Consequences``Related`),
but it lives only as imitation; ADRs 001018 predate it and most lack a `Status`
section.
- **Becomes:** ADR-023 (this design is the basis for that ADR).
- **Reuses:** boma's existing `*-template.md` convention (`service-security-template.md`,
`service-verify-template.md`, `service-access-template.md`, `service-backup-template.md`);
ADR-014 (knowledge-sourcing → the optional `Verified facts` section); ADR-019/020/021/022
(the emergent structure being codified); the `/review-repo` command (enforcement home).
---
## Problem
boma documents architectural decisions as numbered ADRs in `docs/decisions/`, and
CLAUDE.md treats them as load-bearing ("Before assuming a role, provider, or pipeline
exists, check STATUS.md"; the entire "Further reading" table points into them). Yet
there is no ADR that says how an ADR is written. The result:
- **Structural drift.** ADRs 001018 are freeform; 019022 converged on a consistent
shape but only by imitation. A new ADR's structure depends on which existing one the
author happened to copy.
- **No status discipline.** Most early ADRs have no `## Status` section, so there is no
uniform way to tell an active decision from a superseded or deprecated one — and no
written rule for how a decision gets reversed without silently rewriting history.
- **No scaffold.** Every other recurring document type in boma has a template
(`service-security-template.md`, etc.). ADRs do not.
This design codifies the structure 019022 already demonstrate, pins a status
lifecycle, ships a template, and reconciles the back-catalogue.
## Scope
- **In:** the canonical section set (mandatory + optional); title and filename
convention; the `Accepted / Superseded / Deprecated` status lifecycle and the
no-silent-rewrite rule; cross-reference convention; an ADR template file; a
lightweight `/review-repo` structure check; a **one-time retroactive restructure of
ADRs 001018** to full conformance (all four mandatory sections + a parseable Status
line), reorganizing existing content under canonical headings.
- **Out (for now):** *changing the substance of* any existing decision (the restructure
is presentational — relabel/regroup/demote existing content, add a dated Status, never
alter what was decided); a `make lint` / CI gate for ADR structure (explicitly
rejected in favour of the `/review-repo` check — consistent with boma's other doctrine
ADRs, which add no CI gate); grandfathering pre-convention ADRs from the check
(rejected — the whole corpus is brought to conformance instead).
The lifecycle uses four states — `Proposed / Accepted / Superseded / Deprecated`. An
earlier draft of this design omitted `Proposed`, but ADR-011 (a real draft with open
questions) is evidence boma occasionally needs it, so it was kept.
## Decision
### 1. Title & filename
- Title line: `# ADR-NNN — <Title>: <optional clarifying subtitle>` (em-dash `—`,
matching every existing ADR).
- Filename: `NNN-kebab-title.md`, zero-padded 3-digit, monotonic, **never reused**
(a superseded ADR keeps its number and file).
- A new ADR is registered as a row in the CLAUDE.md "Further reading" table.
### 2. Canonical sections
**Mandatory — every ADR, in this order:**
| Section | Holds |
|---|---|
| `## Status` | `Accepted (YYYY-MM-DD)`, plus an optional one-line note (what it resolves/supersedes, or a doctrine-not-yet-built caveat as ADR-022 uses) |
| `## Context` | the forces, the problem, what exists today, why now |
| `## Decision` | what we are doing — numbered sub-decisions for multi-part ADRs, as 020/021/022 do |
| `## Consequences` | results, trade-offs *explicitly accepted*, follow-on work |
**Optional — use only where genuinely applicable, never as padding:**
- `## Related` — links to other ADRs by number.
- `## Scope` — explicit in/out-of-scope boundaries.
- `## Guardrails` / `## Enforcement` — how the decision is mechanically enforced
(lint, CI, hooks).
- `## What was ruled out` — rejected alternatives, each with its reason.
- `## Verified facts (ADR-014)` — version-stamped facts per the knowledge-sourcing rule.
### 3. Status lifecycle
Four states. Most ADRs are **born `Accepted (YYYY-MM-DD)`** — the sole author commits
to it on writing (boma is single-contributor and trunk-based with no review gate).
- **`Proposed (YYYY-MM-DD)`** — a genuine draft whose core direction is recorded but
whose specifics are still open (e.g. ADR-011, which carries open questions). Promoted
to `Accepted (YYYY-MM-DD)` once settled.
- **`Accepted (YYYY-MM-DD)`** — committed-to; the common starting state.
- Replaced by a later decision → the old ADR's Status becomes
**`Superseded by ADR-NNN (YYYY-MM-DD)`**; the superseding ADR records
`Supersedes ADR-MMM` in its own `## Status` and `## Related`. The link is
**bidirectional** — both files must point at each other.
- Retired with no replacement → **`Deprecated (YYYY-MM-DD)`** plus a one-line reason.
**Load-bearing rule — no silent rewrites.** An `Accepted` ADR is not edited to reverse
its decision. Typo and clarity fixes are fine; a *material reversal* requires a new ADR
and a `Superseded by` marker on the old one. The history of decisions stays legible.
### 4. Cross-references
Reference other ADRs by number inline (`ADR-019`), and collect the relationships in a
`## Related` section.
### 5. Template file
Ship `docs/decisions/adr-template.md` — consistent with boma's existing
`*-template.md` convention. It contains the mandatory section headers pre-filled with
short HTML-comment hints, and the optional sections listed as commented stubs to
uncomment when relevant. It is a skeleton, not a numbered decision, so it does not take
an ADR number.
### 6. Retroactive restructure (001018)
A **separate step** after the ADR and template land: bring every pre-convention ADR to
full conformance — all four mandatory sections present and a parseable Status line. This
is a **presentational** restructure, governed by a strict faithfulness rule:
- **Add** a `## Status` section valued `Accepted (YYYY-MM-DD)`, the date reconstructed
from the file's **first git-commit date**. For 016018, whose existing trailing
build-state note is unparseable, prepend the dated `Accepted (...)` clause so the note
becomes a parseable Status line's tail.
- **Reorganize** existing content under the canonical headings: relabel a synonym
(`## Decisions``## Decision`), or introduce a `## Decision` umbrella and **demote**
the existing topical `##` headings to `###` beneath it. No sentence of existing prose
is altered.
- **Add** a `## Consequences` section built **only** from implications the ADR already
states (trade-offs, "what was ruled out", "open questions", follow-on work already
named). If an ADR genuinely states nothing that can be faithfully cast as a
consequence, that file is escalated for a human decision rather than inventing one.
- **Never** change the substance of a decision. A `git diff` of the restructure should
show heading-level changes, a new Status section, and a Consequences section assembled
from existing material — not edits to existing argument.
ADRs already conformant (019022) are left alone. End state: the `adr-structure` check
reports zero findings across the whole corpus, with no grandfathering.
### 7. Enforcement
Lightweight, no CI gate. The `/review-repo` command gains an ADR-structure check:
every file in `docs/decisions/` matching `NNN-*.md` has the four mandatory sections and
a parseable `## Status` line. The template carries the convention forward for new ADRs.
## Consequences
- New ADRs have one obvious shape and a scaffold to start from; structural drift stops.
- Every ADR declares its lifecycle state uniformly, and reversals are traceable rather
than silent — the back-catalogue becomes a legible decision history.
- One-time churn: a restructure touching ~18 files (heading reorganization + a Status
section + a Consequences section per file). Larger and more judgment-heavy than a
Status-only backfill, hence the faithfulness rule and per-file review.
- The whole corpus conforms — the check needs no grandfathering or number threshold, and
stays simple (presence + parseable Status, applied uniformly).
- `/review-repo` grows a new check; no new CI machinery, matching boma's habit of not
gating doctrine in CI.
- This ADR is itself the first conformant example — it must follow its own structure.
## Open questions
None outstanding — title/filename, the **4-state lifecycle** (`Proposed / Accepted /
Superseded / Deprecated`; `Proposed` adopted on the evidence of ADR-011), template name
(`adr-template.md`), enforcement (`/review-repo`, no CI gate), and the **full
retroactive restructure** of 001018 (no grandfathering) were all confirmed during
brainstorming and execution.

View file

@ -0,0 +1,315 @@
# Design — Backup & disaster recovery strategy
- **Date:** 2026-06-10
- **Status:** Approved design — implementation plan written; Plan 1 (foundation) complete (see ADR-022)
- **Resolves:** `docs/TODO.md` item 3.8 ("ensure the right things are backed up,
incl. DB dumps") and `docs/CAPABILITIES.md` §9 (backup engine / off-site / air-gap,
all "planned")
- **Grounds:** the backup substrate that ADR-011 (update management) already leans on
("snapshot-before + backups remain the rollback mechanism", "always dumps the DB /
takes a backup first") but never defined
- **Reuses:** ADR-004 (one service = one role; per-service doc conventions),
ADR-008/017 (`VERIFY.md` per-service checks), ADR-021 (`ACCESS.md` rendered from
role `access__*` data — the same render-from-data pattern), ADR-015 (`ubongo`
recovery model; `mamba` break-glass clone)
- **Becomes:** ADR-022 (this design is the basis for that ADR)
---
## Problem
boma has no defined backup policy. The ADRs assume one exists — ADR-011 makes
"backup-first" the rule for stateful upgrades and "snapshot + backup" the rollback
path — but nothing specifies *what* gets backed up, *how* it stays consistent, *where*
copies live, *how* they're encrypted, or *whether restores actually work*.
`CAPABILITIES.md` §9 sketches an intent (PBS + restic, pCloud off-site, USB air-gap)
but commits to nothing.
This design defines the policy end-to-end: recovery model, what is captured and how,
the 3-2-1 topology, encryption and key escrow with a break-glass path, restore
testing, retention, failure alerting, and the air-gap mechanism.
## Scope
- **In:** application *state* backup for boma's hosts and services; off-site and
air-gapped copies; encryption + key escrow; restore testing; failure alerting;
retention; the backup node.
- **Out (for now):** whole-VM image backup (Proxmox Backup Server) — explicitly
deferred, see Decision 1; a central-vs-per-app database decision (TODO 3.9 — this
design is agnostic to it); Prometheus backup metrics (noted as a later add).
## Decisions (as settled)
### 1. Recovery model — data-only backups, rebuild from code (Model A)
boma's *configuration* is reproducible from this repo: Terraform recreates the VM,
Ansible re-renders the Docker Compose stack. So backups protect **state only** — DB
contents, bind-mount data dirs, Vaultwarden's vault — not whole-VM images.
To recover a host: Terraform re-provisions the VM → Ansible redeploys → restic
restores the data. **No Proxmox Backup Server.** This keeps 3-2-1 cheap, fits
pCloud's 1 TB comfortably, and turns every restore into a continuous proof that the
IaC *and* the backups both work.
Trade-off accepted: recovery is slower than a VM-image restore (a full Ansible run +
data restore, potentially hours), and it bets the repo is complete enough to rebuild
from nothing — which Tier-2 restore testing (Decision 8) exists to verify. **PBS
(Model B) or a per-host hybrid (Model C) can be added later** if real-world RTO proves
too slow; nothing here precludes it.
### 2. One backup tier, ~24 h RPO
A single tier: nightly backup of all state, accepting up to ~24 h of data loss across
the board. No per-data-type tiering yet — revisit once there is real-world data and
experience to justify the added machinery.
### 3. Engine — restic (data) + rclone (off-site); no PBS
- **restic** captures state into an encrypted, deduplicated repository.
- **rclone** replicates the repo to pCloud (pCloud has no good headless Linux client;
rclone has a first-class pCloud backend).
- restic encrypts the repo at rest, so rclone copies **ciphertext only** — no second
encryption layer, no pCloud "crypto folder."
### 4. Topology — central pull node (`fisi`), off the cluster
A single backup node owns the canonical restic repo. It is **off the Proxmox
cluster** — an independent failure domain, so copy 2 survives a PVE node (or the whole
cluster) dying. This mirrors the existing pattern for `ubongo` (control) and `askari`
(off-site): a manually-provisioned physical node in its own inventory group, still
Ansible-managed (base hardening + a `backup` role).
**Pull model.** The backup node holds SSH keys to each host; per service it runs the
declared dump command remotely, pulls the declared paths read-only, then `restic`
snapshots the staged data into its *local* repo. **Hosts hold no backup credentials
and cannot reach the repo** — so a compromised or ransomwared service host cannot
delete backup history.
**Backup node assignment:** `fisi` (an HP Elite 600 G9 tower), penciled in / provisional
— the *role* ("the backup node") is load-bearing; the physical assignment may be
revisited when all hardware is on hand. `fisi` holds **2× 8 TB HDDs in a mirror**
(ZFS or mdraid → 8 TB usable, survives one disk failure; not a stripe). It owns the
repo, runs the pull orchestration, runs `rclone → pCloud`, and **docks the USB
air-gap drives** (Decision 11). Pending one hardware item: the SATA power cable from
the board/PSU to the drives. A data-only restic node is a featherweight workload, so
the G9 is comfortably over-specced.
### 5. 3-2-1 mapping
| Copy | Location | Medium | Off-site | Notes |
|---|---|---|---|---|
| 1 | Live data on each host | NVMe/SSD | no | The working data |
| 2 | `fisi` restic repo | 8 TB HDD mirror | no (on-site, off-cluster) | Canonical repo |
| 3 | pCloud (via rclone) | Cloud | **yes** | Encrypted ciphertext; **sync-coupled** (see Decision 9 / threat model) |
| +4 | USB air-gap drive(s) | Removable HDD, **offline** | yes (stored off-site) | The **immutable backstop**; rotated |
≥3 copies, ≥2 media, ≥1 off-site — satisfied, with the air-gap drive as a fourth,
offline copy that no online compromise can reach.
### 6. Per-service backup contract — `backup__*` data + `BACKUP.md` (hard convention)
Almost every boma service is the same shape: a Docker bind-mount data dir + maybe a
database. Each **service role declares its backup needs** in role vars — the same
render-from-data pattern boma uses for `access__*`/`ACCESS.md` (ADR-021):
```yaml
backup__service: nextcloud # identifier; matches the role / compose project
backup__state: true # false = stateless → no BACKUP.md (pair with a reason)
backup__paths: # bind-mount dirs / files holding state ([] = none)
- /srv/nextcloud/data
backup__dumps: # logical app-consistent dumps (list; [] = none)
- cmd: "docker compose exec -T db pg_dump -U {{ ... }} nextcloud"
dest: nextcloud-db.sql
backup__quiesce: false # true = stop→back up→restart escape hatch
```
(ADR-022 is authoritative for the contract.)
The pull orchestrator reads these (rendered from inventory) and, per service: SSH in →
run the dumps → pull the dump files + declared paths read-only → `restic` snapshot. A
service with **no** `backup__paths` is explicitly "nothing to back up" (declared, not
silent).
**`BACKUP.md` becomes a required per-service doc** alongside `SECURITY.md` /
`VERIFY.md` / `ACCESS.md`, **rendered from the role's `backup__*` data**, documenting:
what state exists, what is backed up, the dump command, and the per-service **restore**
procedure. A template lives at `docs/backup/service-backup-template.md`. `make lint`
gates its presence for service roles.
### 7. Consistency — logical dumps first, quiesce as an escape hatch
- **Default (A):** databases are captured with logical dumps (`pg_dump` /
`mysqldump`) — portable, version-independent, restorable to a fresh DB. Plain data
dirs are backed up as files. No downtime. Cost: every stateful service must declare
a working dump command, *tested by restore drills*.
- **Escape hatch (B):** a service whose data cannot be dumped live declares a
quiesce step (stop container → back up volume → restart) in the same contract.
- ZFS/filesystem snapshots are **not** used as the sole DB method (only
crash-consistent for a live database).
This is agnostic to the open central-vs-per-app database question (TODO 3.9): either
way, each service declares how to dump its own data.
### 8. Restore testing — two tiers
- **Tier 1 — frequent, automated, rolling restore-verify (weekly).** Pick the next
service in rotation, restore its latest snapshot into a throwaway **container on
`ubongo`** (reusing boma's existing Molecule harness, ADR-015), start the app
against the restored data, and **run that service's `VERIFY.md` checks**
(ADR-008/017) against it, then tear down. This catches the failure that actually
kills people — *silently corrupt or unrestorable backups*. Failures alert via ntfy.
- **Tier 2 — rare, full DR rehearsal (semi-annual), driven from `ubongo` onto PVE
staging.** Rebuild a host from zero via Terraform + Ansible + restic restore on the
staging cluster (only a real PVE node can host the VM; `ubongo` orchestrates). This
validates the whole Model-A recovery chain, not just "can I read a snapshot."
**At least once a year the rehearsal exercises the paper-secret break-glass path**
(Decision 10) end-to-end.
`ubongo` stays **bare Debian, not a hypervisor** (ADR-015 unchanged): its job is to be
the independent recovery anchor — "the tool used to rebuild the cluster must not live
inside the thing it rebuilds." Higher-fidelity real-VM testing is *better* served by
the PVE staging env (same hardware class, same cluster, same provisioning path) than
by converting `ubongo`. `ubongo`'s real spec is a ThinkCentre M70q (i3-10100T / 16 GB
/ **1 TB NVMe**) — the 1 TB gives ample room for Tier-1 dataset restores; disk
headroom (not CPU/RAM) is the first thing to watch as data grows (`/capacity-review`).
### 9. Retention — GFS via restic
Starting policy: `--keep-daily 7 --keep-weekly 4 --keep-monthly 6 --keep-yearly 1`.
`restic forget --prune` runs nightly on `fisi`'s repo; pCloud mirrors the pruned repo.
Tune once real repo growth is observed.
### 10. Encryption + key escrow + break-glass
restic already encrypts the repo, so **one secret — the restic repo password —
protects all copies uniformly** (fisi, pCloud, USB). One thing to escrow, not three.
**Escrow locations:**
- **`fisi`, root-only** (+ in the Ansible vault) — so backups run non-interactively
and `fisi` is redeployable.
- **Vaultwarden** — the day-to-day human-accessible copy.
- **Paper, in a physical safe (off-site)** — the break-glass root of trust; the only
copy that survives "everything is down."
**Model-A twist — the paper holds *two* secrets, not one:**
1. the **restic repo password** (to read any backup at all), and
2. the **Ansible vault master password** (to rebuild hosts from the repo — normally
from Vaultwarden via `rbw`, which is itself down in a from-zero recovery).
With both on paper, the break-glass chain has **no circular dependency**: paper →
restic restores Vaultwarden + repo data → the vault password (from paper) drives
Terraform/Ansible re-provisioning → services return, `rbw` works again. `ubongo`'s
ADR-015 recovery model already establishes **`mamba` (laptop) as a break-glass clone**
(repo + toolchain + mesh + `rbw`, with Terraform state synced to it) — the rebuild can
be driven from `mamba` if `ubongo` is also gone. The printed sheet is a short
**break-glass runbook** assuming zero running boma infrastructure: install restic on
any machine, point it at pCloud *or* a USB drive with the password, restore Vaultwarden
first, then rebuild with the vault password.
### 11. USB air-gap trigger (plug-and-go cold copy)
A **udev rule on `fisi` matching an allowlist of known drive serials** triggers a
systemd unit → script that: mounts the drive, confirms it is an expected drive, runs
**`restic copy` from the local repo → a restic repo on the USB drive** (dedup-aware,
same password → ciphertext if lost/stolen), runs `restic check` on the USB copy,
unmounts, and **notifies via ntfy** with the result. Only allowlisted serials trigger
anything (a rogue USB does nothing).
`restic copy` (not rsync) so the USB is itself a valid restic repo — restorable
**directly** in a break-glass with nothing else alive. Rotate among a few drives,
**stored off-site** → also a second *geographic* off-site copy independent of pCloud.
### 12. Failure alerting — guard against silent death
Success/failure pings alone miss the worst case (*the job silently stopped running*):
- **Dead-man's-switch:** every successful nightly run pings an **Uptime Kuma push
monitor** (already in the planned stack); no ping in ~25 h → alert.
- **Immediate failure → ntfy** on any job or dump-step error.
- **Periodic `restic check`** (weekly) for repo integrity → alert on corruption.
- **Tier-1 restore-verify failures → ntfy.**
- *(Later)* emit last-success timestamp + repo size as Prometheus metrics for a
Grafana panel (fits ADR-018's monitoring direction; not required for v1).
### 13. Schedule
- **Nightly backup run (~02:0004:00),** driven by `fisi` (pull): per host →
run dumps → pull paths read-only → `restic` snapshot → `restic forget --prune`
(Decision 9) → `rclone sync` → pCloud. Sequential, off-hours.
- **Tier-1 restore-verify:** weekly, rolling one service, on `ubongo`.
- **Tier-2 DR rehearsal:** semi-annual on staging; ≥1/year exercises the paper path.
- **USB air-gap:** manual, ~monthly, whenever a drive is docked.
## Architecture & data flow (nightly run)
```
┌─────────────────────────────────────────┐
docker_hosts / etc. │ fisi (backup node) │
┌───────────┐ SSH │ pull orchestrator (reads backup__* ) │
│ service A │◀─────────│ 1. ssh host → run dumps (pg_dump…) │
│ + DB │ pull RO │ 2. pull dump + backup__paths (read-only)│
└───────────┘─────────▶│ 3. restic snapshot → local repo (mirror)│
┌───────────┐ │ 4. restic forget --prune (GFS) │
│ service B │ │ 5. rclone sync repo → pCloud (offsite) │
└───────────┘ │ 6. heartbeat → Uptime Kuma; errors→ntfy│
└───────────────┬──────────────────────────┘
│ (manual, ~monthly)
udev: known drive plugged
restic copy → USB repo (air-gap, offline)
```
Restore (Model A): Terraform re-provisions the VM → Ansible redeploys the role →
restic restores `backup__paths` + replays the dump → `VERIFY.md` confirms.
## Components & boundaries
- **`backup` role (on `fisi`):** pull orchestrator, restic repo management, retention
prune, rclone→pCloud sync, udev/air-gap unit, alerting hooks. New inventory group
(e.g. `backup_hosts`) with the `base` role applied, like `control`/`offsite_hosts`.
- **Per-service backup contract:** `backup__*` role vars + rendered `BACKUP.md`
(Decision 6); a hard convention enforced by `make lint`.
- **`ubongo`:** schedules/drives Tier-1 (local container) and Tier-2 (onto staging);
unchanged role per ADR-015.
- **Secrets:** restic password + rclone token in `fisi` (root-only) and the Ansible
vault; escrowed per Decision 10.
## Threat model / 3-2-1 honesty
- **`rclone sync` propagates deletions** — a prune, or a *malicious* wipe of `fisi`'s
repo, replicates to pCloud. pCloud is therefore the **off-site** copy but **not
immutable**. Mitigations: the **USB air-gap drive is the immutable backstop**
(offline = unreachable by any online compromise) and **pCloud's own file-version
history** is enabled as a recovery cushion.
- **Pull model** stops a compromised *service host* from touching the repo.
- **`fisi` is the crown-jewel host** — it holds an encrypted copy of all state, so it
gets full base hardening and tight access. restic encryption means a stolen `fisi`
(or USB, or pCloud blob) yields ciphertext only.
- **pCloud's 1 TB is the smallest copy → the off-site capacity ceiling.** Data-only
backups fit for years at homelab scale; flag for `/capacity-review` if the repo
trends toward ~1 TB.
## What this changes in the repo (for the plan)
- New `backup` role + `backup_hosts` inventory group; `fisi` hardware-reference entry.
- New per-service convention: `backup__*` vars + `BACKUP.md` (template at
`docs/backup/service-backup-template.md`); `make lint` gate; update role-conventions
in `CLAUDE.md` and the new-role scaffolding/runbook.
- Update `docs/hardware/reference.md`: `ubongo` = M70q (i3-10100T/16 GB/**1 TB**);
add `fisi`.
- Update `CAPABILITIES.md` §9 (PBS → deferred; restic+rclone+USB the committed engine).
- Close `docs/TODO.md` 3.8; cross-reference from ADR-011.
- The break-glass runbook (printed sheet + `docs/runbooks/`), referencing ADR-015's
`mamba` clone and Terraform-state survival.
## Non-goals / YAGNI
- No PBS / whole-VM images in v1 (Decision 1).
- No per-data-type RPO tiering in v1 (Decision 2).
- No second encryption layer over restic (Decision 3).
- No central NAS/file-share scope creep on `fisi` — it stays single-purpose.
## Open / deferred
- Central vs per-app database (TODO 3.9) — orthogonal; this design works either way.
- Prometheus backup metrics — later add (Decision 12).
- PBS (Model B) or hybrid (Model C) — revisit if real-world RTO is too slow.

Some files were not shown because too many files have changed in this diff Show more