boma/docs/FRICTION.md
sjat 180af46879 docs(friction): log the Molecule input_only-accept coverage gap
Final-review finding: the default Molecule scenario only renders the forward
drop (input_only off) branch; the accept branch is covered by the integration
harness only. Tracked for a kaizen decision (2nd scenario vs accept the split).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:40:29 +02:00

24 KiB
Raw Blame History

FRICTION.md — kaizen friction log

Raw signals for the periodic kaizen review (/kaizen; see docs/TODO.md 11). This is the input that keeps our tooling and conventions sharpening over time instead of only accreting.

How to use: append freely during work under Open signals — don't curate, don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling that isn't earning its keep. /kaizen reads this, then proposes a verdict per signal (SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased toward remove/park for unused tooling), migrates durable knowledge into the right docs, and moves consumed signals into the decisions ledger below.

Entry format: date — [tag] observation — (optional) → systematization idea Tags: [friction] recurring annoyance · [gotcha] surprising behaviour · [recurring] keeps coming back, should be systematized · [unused] tooling not earning its keep.


Open signals

(append new raw signals here; the next kaizen review consumes them)

  • [gotcha] base's nftables forward policy drop breaks Docker hosts on reboot (2026-06-17): base/templates/nftables.conf.j2 sets chain forward { ... policy drop; }. On a Docker host, container traffic is forwarded (published-port DNAT → container, and inter-container over the bridge), so the drop kills it. It worked right after make deploy (Docker's runtime rules coexisted) but after a reboot nftables loaded our default-deny before Docker, breaking WAN→Caddy and Caddy→coordinator → the public services and the mesh went down. The docker_host "nftables.d container-forward rules" that would make this Docker-safe are explicitly pending (STATUS.md). → the base firewall (base__firewall_apply) must NOT be applied to any Docker host until docker_host ships the container-forward rules; add a guard/check (a Docker host with firewall_apply: true and no container-forward drop-in is a misconfiguration), and the firewall design (ADR-020) should state the Docker-host dependency explicitly.

  • [gotcha] ip_nonlocal_bind did NOT beat the sshd boot-race (2026-06-17): the mesh-hardening plan bound sshd ListenAddress to the wt0 IP and set net.ipv4.ip_nonlocal_bind=1 so sshd could bind the mesh IP before wt0 exists at boot. In practice the console still showed sshd "could not assign the address" at boot — so the protection did not work as designed, and because wt0 never came up (the coordinator was down), sshd had no listener at all → no SSH path. → the entire "sshd listens on wt0 only" premise is unsound without (a) a verified boot-race fix and (b) a guaranteed non-mesh break-glass. Re-investigate why ip_nonlocal_bind didn't help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?), or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.

  • [gotcha] The coordinator host can't bootstrap the mesh it depends on (2026-06-17): askari runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird agent needs the coordinator (a local container) to be serving to bring up wt0 — but the coordinator wasn't healthy, so wt0 never came up. Circular. Combined with sshd being wt0-only, the host was reachable only via the Hetzner console. → the coordinator host must keep a non-mesh management path always (don't move its SSH onto wt0), or the mesh-hardening must treat the coordinator host as a special case. General rule: never make a host's only management path depend on a service that host itself hosts.

  • [gotcha] NetBird netbird-server FATAL-loops on the geolocation DB download with no egress (2026-06-17): on startup the combined netbird-server:0.72.4 tries to download the GeoLite2 DB from pkgs.netbird.io and treats failure as FATAL (crash-loop) — so any loss of container egress (here: Docker NAT masquerade wiped when nftables was flushed, not re-added by a plain restart docker) takes the whole control plane down. Recovery was restart docker (rebuild NAT) → force-recreate the container so it could download. → for the netbird_coordinator role: pre-seed/persist the geo DB in the data dir (or pin a local copy), or disable the geolocation requirement, so a transient egress blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT) is fragile across nft flush + reboot.

  • [friction] No off-site coordinator backup turned a 2-minute restore into a long live recovery (2026-06-17): the NetBird coordinator's stateful store (/var/lib/netbird, encrypted SQLite) has no off-site backup yet (ADR-022 backup role pending, flagged in STATUS as the coordinator's deferred backup). During the incident there was a real fear the unclean reboots had corrupted the store, with no restore path. It turned out to be a runtime/egress issue, not corruption — but the absence of a backup made the whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the netbird_coordinator store ahead of the rest of the backup role; a recent off-host copy would have made "rebuild askari from scratch" a safe option.

  • [friction] The plan tested reboot-recovery AFTER removing the break-glass (2026-06-17): the mesh-hardening plan's live cutover closed the WAN :22 (step 5) before the reboot-resilience test (step 7), so the one fallback path was gone exactly when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for lockout-risky cutovers: validate reboot-recovery while the old access path is still open, and only retire the break-glass once recovery (incl. a reboot) is proven. Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.

  • [gotcha] Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS (2026-06-18): virt-install --import of the genericcloud qcow2 with the default (SeaBIOS) firmware triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test VMs via UEFI (virt-install --boot uefi; OVMF→efistub).

  • [friction] The no-sudo claude model blocked diagnosing a failed VM (2026-06-18): under ADR-015 claude had no sudo, so when the VM wouldn't network there was no way to introspect it (serial logs are root:0600, libguestfs not installed, mounting needs root). Diagnosis was fully blocked until the operator granted claude sudo. → DECISION: claude gets NOPASSWD:ALL (reverses ADR-015's "no local sudo"); compensating control is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks; codify the sudoers drop-in in Ansible.

  • [gotcha] Non-root virsh/virt-install default to qemu:///session (2026-06-18): the substrate (NAT net, /dev/kvm) lives on qemu:///system. → pin LIBVIRT_DEFAULT_URI=qemu:///system in the driver.

  • [gotcha] qemu:///system (libvirt-qemu) can't traverse /home (2026-06-18): VM disk/seed/console under the repo/home failed "Permission denied (search permissions for /home/claude)". → put per-VM artifacts in a system-readable dir (/var/lib/boma-integration, group libvirt); the inventory (read by ansible as the user) can stay in the repo.

  • [gotcha] ansible-playbook -i <dir>/ parses sibling non-inventory files as INI (2026-06-18): pointing -i at a run-dir holding a state file + qcow2s made the directory inventory loader parse the state file as INI → phantom hosts INCLUDING the real askari (with its real vars), breaking the single-host isolation invariant. → point -i at the single hosts.yml. Caught by the holistic cross-file review BEFORE any hardware run.

  • [gotcha] Jinja {%- -%} + ansible trim_blocks=True double-strip newlines (2026-06-18): a template edit used {%- -%}, reviewed by rendering with RAW jinja2 (trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain {% %} at column 0 (the repo's existing style).

  • [gotcha] Fresh cloud images have empty apt lists (2026-06-18): apt install nftables failed "No package matching 'nftables' is available" on a fresh genericcloud VM whose cloud-init had package_update: false. → package_update: true AND block on cloud-init status --wait before applying.

  • [gotcha] base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is allowed (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway (192.168.150.1). ct established,related accept saves the in-flight apply connection, but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets base__firewall_control_addr to the NAT gateway.

  • [recurring] Real-hardware shakedown and static review each caught what the other couldn't (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial cross-file review AND a real-hardware run; neither alone would have shipped it working.

  • [friction] Raw DHCP leases pinned in ubongo's host firewall (admin-addr SSH allows) (2026-06-19): mesh-hardening 2/3 lets the operator workstations reach ubongo's LAN SSH by raw leasebase__firewall_admin_addrs: ["10.20.10.50" (mamba), "10.20.10.17"] — because there is no DHCP reservation yet (OPNsense isn't managed as code). A lease reassignment silently moves the allow to whatever host next holds the IP (still SSH-key-gated) and drops the workstation's LAN path (mesh still works, so never a full lockout). → when OPNsense-as-code lands (ADR-020 perimeter / TODO 3.5), replace both with MAC-pinned DHCP reservations (10.20.10.17 = MAC bc:0f:f3:c8:4a:8a; mamba's MAC TBD) and allow the reserved IPs. Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md.

  • [gotcha] make test-integration on ubongo fails (qemu-img "Permission denied") when the agent session predates the libvirt group grant (2026-06-19): the integration_test role adds claude to libvirt+kvm and makes the cache dir /var/lib/boma-integration root:libvirt 2775 — correct — but a claude session whose shell started before that grant carries a stale process group set (idclaude,docker only, no libvirt), so qemu-img create of the VM overlay into the group-owned dir is denied. virsh/virt-install still work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side as libvirt-qemu), so ONLY claude's own file-writes break. Unblock without restarting the session: sg libvirt -c 'make test-integration HOST=<name>' (claude needs only libvirt for the dir; kvm is server-side; note sg adds one group, not the full set). → self-heal in scripts/integration-vm.py: if the libvirt gid is absent from os.getgroups(), re-exec under sg libvirt (or have the Makefile target do it), so a stale-session agent never hits this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session transient — but high-confusion, worth self-healing.

  • [friction] No standard for when the agent may run local-VM integration tests on ubongo without asking (2026-06-19): make test-integration HOST=<name> spins an ISOLATED throwaway KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards: one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3, Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent just runs it. → decide + record the rule: e.g. a .claude/settings.json permission allow for make test-integration* / scripts/integration-vm.py (and the sg libvirt -c '…' form per the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests from the genuinely-gated live steps (make deploy to real hosts, host reboots, cutovers — still need a go-ahead). Ties to the test-risky-infra-before-live-deploy + dont-reask-settled-defaults memories + ADR-025.

  • [gotcha] Molecule covers only the input_only-OFF (forward drop) branch of the base firewall (2026-06-19): mesh-hardening 2/3 added base__firewall_input_only (forward policy drop↔accept). The default Molecule scenario renders ONE fixture, set to the secure default (drop) — so the fast make test ROLE=base gate locks the drop default (security-critical for service hosts) but does NOT exercise the =true → forward-accept rendering; only make test-integration HOST=ubongo does (passed GREEN). An in-converge re-render can't cheaply cover it (role defaults aren't in scope outside the role run). → decide in kaizen: a second Molecule scenario (molecule/input-only/) asserting forward policy accept, vs accepting the integration-only coverage. Final-review finding; not a cutover blocker (the accept branch is a literal, and a var-name break would fail the drop branch too → caught).


Kaizen reviews — decisions ledger

Consumed signals and where their resolution now lives. Newest first.

2026-06-17

Second /kaizen run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items (the rename-incomplete scan check and the Forgejo registry-login path) were built by parallel subagents and verified against the diff. Bias-to-remove note: one PARK (the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero REMOVE; the rest accreted (migrate/change). None of the open signals were [unused] tooling, so there was nothing to delete — the only reductive move available was parking the out-of-phase build. Cadence: healthy — 3 days after the first run, every signal 02 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in scripts/friction-scan.py didn't fire this pass (all recurrence counts were 1), so the thresholds need no change.

Signal (first seen) Verdict Resolution / where it lives now
ADRs claim cross-doc reconciliation they didn't perform (06-14) SYSTEMATIZE New rename-incomplete check in scripts/repo-scan.py (+7 tests): when a numbered ADR announces a rename OldNew, flag any design-doc line where Old still appears in present tense (skips the announcing ADR, lines also naming New, and historical/negation cues; rejects ADR-NNN tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of stale-deferred; run by /review-repo. (Was KEEP-OPEN on 2026-06-14 — now built.)
Image push to the Forgejo registry needs an interactive docker login (06-15) SYSTEMATIZE → vault Vault-backed login path so pushes are agent-completable: vault.forgejo.registry_token stub (CHANGEME, operator-minted) + scripts/registry-login.sh (reads the token, docker login --password-stdin, never echoes it) + make registry-login + a prereq note in docs/runbooks/claude-code-setup.md. Works once the operator fills the token via make edit-vault.
Single-file bind mount + atomic rewrite = stale config (06-16) SYSTEMATIZE docs/testing/gotchas.md — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": template writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config directory for reload-in-place roles; restart-based roles are fine with a single-file mount.
make check always fails on the first-ever deploy of a compose service role (06-16) CHANGE check_mode: false on the state: directory scaffold tasks in roles/reverse_proxy + roles/netbird_coordinator, so the base dirs exist under --check and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing project_src. Inert under converge → Molecule unchanged.
Re-asked settled defaults — push + execution mode, in prose (06-17) CHANGE (exec) + ACCEPTED (push) Widened .claude/hooks/guard-execution-mode-menu.sh to also catch free-form prose re-asks of the subagent-vs-inline choice ("which execution approach?", "subagent vs inline", …), not just the literal menu; tested. The push re-ask stays a soft default via the dont-reask-settled-defaults memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked.
Docs-only commit tripped the rbw-locked pre-commit guard (06-17) CHANGE Root cause was NOT the ansible-lint files: scope (innocent) — it was .claude/hooks/guard-vault-preflight.sh blocking every locked git commit. Rewrote it to inspect the staged set (git diff --cached, plus -a/--all) and block only when Ansible content (^(roles|playbooks|inventories)/.*\.ya?ml$) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested.
Agent can't self-manage ubongo (the control node it runs on) without operator grants (06-17) PARK The knowledge already lives in STATUS.md (control-node row: the interim claude-key + sjat NOPASSWD grants, and Pending: the proper ansible-user bootstrap) and the ubongo-self-sufficiency memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. Resurrection trigger: when building ubongo's base hardening / ansible-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants.

2026-06-14

First /kaizen run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above — a repo-scan.py check is its own build). Bias-to-remove note: zero PARK/REMOVE — none of the open signals were [unused] tooling; they were all knowledge/gotchas/process, which migrate or archive (knowledge is never deleted).

Signal (first seen) Verdict Resolution / where it lives now
Execution-mode menu asked AGAIN — 5× (06-05→06-14) ALREADY-BUILT The 06-10 mechanical guard (.claude/hooks/guard-execution-mode-menu.sh, wired in .claude/settings.json) is verified firing on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect.
Brainstorming spec-review gate fires despite the standing agreement (06-10) CHANGE → mechanical Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu.
Subagent faithfulness self-reports can be wrong (06-10) ACCEPTED The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in superpowers:subagent-driven-development, used for the /kaizen build itself. Revisit if it recurs.
ADR-writing policy unsettled (05-31) ALREADY-BUILT ADR-023 (ADR structure & lifecycle) + docs/decisions/adr-template.md settle status/sections — both postdate this signal.
Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) ALREADY-BUILT → RESOLVED 2026-06-15 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral closed — root cause was version skew (pre-Bearer libdns/gandi sent Gandi's deprecated Apikey header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + roles/reverse_proxy.
apply:{tags} not propagated by dynamic include_tasks (06-14) SYSTEMATIZE docs/testing/gotchas.md — "Tags on dynamic include_tasks need apply:".
Molecule CAN test tag-propagation, via a tagged converge (06-14) SYSTEMATIZE docs/testing/gotchas.md — "Testing concern-tag isolation in Molecule".
apply=false Molecule + data-pytest gap for API/templating roles (06-14) SYSTEMATIZE docs/testing/gotchas.md — "API / templating roles: render-only tests miss the real call".
item.values in a loop sends the dict method, not the key (06-14) SYSTEMATIZE → CLAUDE.md Ansible conventions ("index loop-var keys with item['key'], never item.key").
TF child modules need their own required_providers (06-14) SYSTEMATIZE → CLAUDE.md Terraform conventions ("every module declares its own required_providers in versions.tf").
ansible-lint var-naming rejects access__/backup__ cross-role names (06-14) SYSTEMATIZE make new-role scaffolds a noqa reminder in defaults/main.yml; ADR-004's service-role section documents the convention; roles/reverse_proxy/defaults/main.yml is the reference.
Gandi rejects RFC-7505 null-MX 0 . (06-14) MIGRATE roles/public_dns/README.md Notes (no MX + SPF -all + DMARC reject for a no-mail domain).

2026-06-10

Signal (first seen) Verdict Resolution / where it lives now
Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) CHANGE → mechanical Stop hook in .claude/settings.json blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a mechanical guard, not another note.
Every git commit needs rbw unlock — recurring (05-30) CHANGE Root cause was not the vault syntax-check (.ansible-lint already excludes vault.yml); it was ansible-lint auto-loading + decrypting inventories/production/group_vars/all/vault.yml via the wired vault_password_file. Scoped the pre-commit ansible-lint hook (always_run: false + files: ansible content) so docs-/config-only commits skip it and need no vault. Ansible-content commits still need rbw (intrinsic to linting vault-backed plays; accepted).
make test fails when run non-activated — ansible-config not found (06-06) CHANGE Makefile test/test-all now prepend $(CURDIR)/.venv/bin to PATH.
Molecule image missing from the Forgejo registry (06-06) already built make molecule-image-push target exists.
Deferred decision goes stale across docs — 3× (06-05) already built scripts/repo-scan.py open-deferred-item / stale-deferred checks, run by /review-repo.
make new-role brace-expansion fails under dash (05-30) fixed Explicit paths in the Makefile target.
nft iif vs iifname, Molecule ansible_host, apply-path coverage blind spot, render-nft -c pattern (06-06) MIGRATE docs/testing/gotchas.md (pointer from ADR-008).
hooks-need-restart, pre-commit stashes unstaged, rbw sync stale cache, zsh word-split (05-30) MIGRATE docs/runbooks/claude-code-setup.md "Environment gotchas".
finishing-a-development-branch offers open-a-PR vs our trunk-based merge (06-01) accepted Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs.

Process note: the 2026-06-10 review was manual (the /retro//kaizen tool wasn't built). The 2026-06-14 block was the first run of /kaizen itself (scripts/friction-scan.py Phase 0 + .claude/commands/kaizen.md); the dogfood both cleared the backlog and validated the command.