sjat/boma

sjat d6e80990b2 fix(integration): real wait_for_ip arp-fallback test + document substrate coverage gap

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-19 22:41:11 +02:00

31 KiB

Raw Blame History

FRICTION.md — kaizen friction log

Raw signals for the periodic kaizen review (/kaizen; see docs/TODO.md 11). This is the input that keeps our tooling and conventions sharpening over time instead of only accreting.

How to use: append freely during work under Open signals — don't curate, don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling that isn't earning its keep. /kaizen reads this, then proposes a verdict per signal (SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased toward remove/park for unused tooling), migrates durable knowledge into the right docs, and moves consumed signals into the decisions ledger below.

Entry format: date — [tag] observation — (optional) → systematization idea Tags: [friction] recurring annoyance · [gotcha] surprising behaviour · [recurring] keeps coming back, should be systematized · [unused] tooling not earning its keep.

Open signals

(append new raw signals here; the next kaizen review consumes them)

[friction] Re-asked settled defaults (push + subagent-driven) at the plan→execute handoff (2026-06-19): despite the standing preference (memory dont-reask-settled-defaults: push to origin as off-machine backup and go subagent-driven, both WITHOUT asking), I again asked the operator "which execution approach?" and "want me to push?". The writing-plans skill scripts that handoff question ("Which approach?"), and confirming a push felt natural — both overrode the memory. → at the writing-plans → execution handoff, default to subagent-driven execution and push to origin without a confirmation gate; reserve questions for genuine forks. Recurrence of an already-recorded signal — treat the skill's scripted "Which approach?" as pre-answered (subagent-driven) for this operator.

[gotcha] base's nftables forward policy drop breaks Docker hosts on reboot (2026-06-17): base/templates/nftables.conf.j2 sets chain forward { ... policy drop; }. On a Docker host, container traffic is forwarded (published-port DNAT → container, and inter-container over the bridge), so the drop kills it. It worked right after make deploy (Docker's runtime rules coexisted) but after a reboot nftables loaded our default-deny before Docker, breaking WAN→Caddy and Caddy→coordinator → the public services and the mesh went down. The docker_host "nftables.d container-forward rules" that would make this Docker-safe are explicitly pending (STATUS.md). → the base firewall (base__firewall_apply) must NOT be applied to any Docker host until docker_host ships the container-forward rules; add a guard/check (a Docker host with firewall_apply: true and no container-forward drop-in is a misconfiguration), and the firewall design (ADR-020) should state the Docker-host dependency explicitly.
[gotcha] ip_nonlocal_bind did NOT beat the sshd boot-race (2026-06-17): the mesh-hardening plan bound sshd ListenAddress to the wt0 IP and set net.ipv4.ip_nonlocal_bind=1 so sshd could bind the mesh IP before wt0 exists at boot. In practice the console still showed sshd "could not assign the address" at boot — so the protection did not work as designed, and because wt0 never came up (the coordinator was down), sshd had no listener at all → no SSH path. → the entire "sshd listens on wt0 only" premise is unsound without (a) a verified boot-race fix and (b) a guaranteed non-mesh break-glass. Re-investigate why ip_nonlocal_bind didn't help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?), or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.
[gotcha] The coordinator host can't bootstrap the mesh it depends on (2026-06-17): askari runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird agent needs the coordinator (a local container) to be serving to bring up wt0 — but the coordinator wasn't healthy, so wt0 never came up. Circular. Combined with sshd being wt0-only, the host was reachable only via the Hetzner console. → the coordinator host must keep a non-mesh management path always (don't move its SSH onto wt0), or the mesh-hardening must treat the coordinator host as a special case. General rule: never make a host's only management path depend on a service that host itself hosts.
[gotcha] NetBird netbird-server FATAL-loops on the geolocation DB download with no egress (2026-06-17): on startup the combined netbird-server:0.72.4 tries to download the GeoLite2 DB from pkgs.netbird.io and treats failure as FATAL (crash-loop) — so any loss of container egress (here: Docker NAT masquerade wiped when nftables was flushed, not re-added by a plain restart docker) takes the whole control plane down. Recovery was restart docker (rebuild NAT) → force-recreate the container so it could download. → for the netbird_coordinator role: pre-seed/persist the geo DB in the data dir (or pin a local copy), or disable the geolocation requirement, so a transient egress blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT) is fragile across nft flush + reboot.
[friction] No off-site coordinator backup turned a 2-minute restore into a long live recovery (2026-06-17): the NetBird coordinator's stateful store (/var/lib/netbird, encrypted SQLite) has no off-site backup yet (ADR-022 backup role pending, flagged in STATUS as the coordinator's deferred backup). During the incident there was a real fear the unclean reboots had corrupted the store, with no restore path. It turned out to be a runtime/egress issue, not corruption — but the absence of a backup made the whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the netbird_coordinator store ahead of the rest of the backup role; a recent off-host copy would have made "rebuild askari from scratch" a safe option.
[friction] The plan tested reboot-recovery AFTER removing the break-glass (2026-06-17): the mesh-hardening plan's live cutover closed the WAN :22 (step 5) before the reboot-resilience test (step 7), so the one fallback path was gone exactly when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for lockout-risky cutovers: validate reboot-recovery while the old access path is still open, and only retire the break-glass once recovery (incl. a reboot) is proven. Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.

[gotcha] Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS (2026-06-18): virt-install --import of the genericcloud qcow2 with the default (SeaBIOS) firmware triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test VMs via UEFI (virt-install --boot uefi; OVMF→efistub).
[friction] The no-sudo claude model blocked diagnosing a failed VM (2026-06-18): under ADR-015 claude had no sudo, so when the VM wouldn't network there was no way to introspect it (serial logs are root:0600, libguestfs not installed, mounting needs root). Diagnosis was fully blocked until the operator granted claude sudo. → DECISION: claude gets NOPASSWD:ALL (reverses ADR-015's "no local sudo"); compensating control is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks; codify the sudoers drop-in in Ansible.
[gotcha] Non-root virsh/virt-install default to qemu:///session (2026-06-18): the substrate (NAT net, /dev/kvm) lives on qemu:///system. → pin LIBVIRT_DEFAULT_URI=qemu:///system in the driver.
[gotcha] qemu:///system (libvirt-qemu) can't traverse /home (2026-06-18): VM disk/seed/console under the repo/home failed "Permission denied (search permissions for /home/claude)". → put per-VM artifacts in a system-readable dir (/var/lib/boma-integration, group libvirt); the inventory (read by ansible as the user) can stay in the repo.
[gotcha] ansible-playbook -i <dir>/ parses sibling non-inventory files as INI (2026-06-18): pointing -i at a run-dir holding a state file + qcow2s made the directory inventory loader parse the state file as INI → phantom hosts INCLUDING the real askari (with its real vars), breaking the single-host isolation invariant. → point -i at the single hosts.yml. Caught by the holistic cross-file review BEFORE any hardware run.
[gotcha] Jinja {%- -%} + ansible trim_blocks=True double-strip newlines (2026-06-18): a template edit used {%- -%}, reviewed by rendering with RAW jinja2 (trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain {% %} at column 0 (the repo's existing style).
[gotcha] Fresh cloud images have empty apt lists (2026-06-18): apt install nftables failed "No package matching 'nftables' is available" on a fresh genericcloud VM whose cloud-init had package_update: false. → package_update: true AND block on cloud-init status --wait before applying.
[gotcha] base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is allowed (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway (192.168.150.1). ct established,related accept saves the in-flight apply connection, but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets base__firewall_control_addr to the NAT gateway.
[recurring] Real-hardware shakedown and static review each caught what the other couldn't (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial cross-file review AND a real-hardware run; neither alone would have shipped it working.

[friction] Raw DHCP leases pinned in ubongo's host firewall (admin-addr SSH allows) (2026-06-19): mesh-hardening 2/3 lets the operator workstations reach ubongo's LAN SSH by raw lease — base__firewall_admin_addrs: ["10.20.10.50" (mamba), "10.20.10.17"] — because there is no DHCP reservation yet (OPNsense isn't managed as code). A lease reassignment silently moves the allow to whatever host next holds the IP (still SSH-key-gated) and drops the workstation's LAN path (mesh still works, so never a full lockout). → when OPNsense-as-code lands (ADR-020 perimeter / TODO 3.5), replace both with MAC-pinned DHCP reservations (10.20.10.17 = MAC bc:0f:f3:c8:4a:8a; mamba's MAC TBD) and allow the reserved IPs. Spec: docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md.
[gotcha] make test-integration on ubongo fails (qemu-img "Permission denied") when the agent session predates the libvirt group grant (2026-06-19): the integration_test role adds claude to libvirt+kvm and makes the cache dir /var/lib/boma-integration root:libvirt 2775 — correct — but a claude session whose shell started before that grant carries a stale process group set (id → claude,docker only, no libvirt), so qemu-img create of the VM overlay into the group-owned dir is denied. virsh/virt-install still work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side as libvirt-qemu), so ONLY claude's own file-writes break. Unblock without restarting the session: sg libvirt -c 'make test-integration HOST=<name>' (claude needs only libvirt for the dir; kvm is server-side; note sg adds one group, not the full set). → self-heal in scripts/integration-vm.py: if the libvirt gid is absent from os.getgroups(), re-exec under sg libvirt (or have the Makefile target do it), so a stale-session agent never hits this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session transient — but high-confusion, worth self-healing.
[friction] No standard for when the agent may run local-VM integration tests on ubongo without asking (2026-06-19): make test-integration HOST=<name> spins an ISOLATED throwaway KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards: one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3, Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent just runs it. → decide + record the rule: e.g. a .claude/settings.json permission allow for make test-integration* / scripts/integration-vm.py (and the sg libvirt -c '…' form per the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests from the genuinely-gated live steps (make deploy to real hosts, host reboots, cutovers — still need a go-ahead). Ties to the test-risky-infra-before-live-deploy + dont-reask-settled-defaults memories + ADR-025.
[gotcha] Molecule covers only the input_only-OFF (forward drop) branch of the base firewall (2026-06-19): mesh-hardening 2/3 added base__firewall_input_only (forward policy drop↔accept). The default Molecule scenario renders ONE fixture, set to the secure default (drop) — so the fast make test ROLE=base gate locks the drop default (security-critical for service hosts) but does NOT exercise the =true → forward-accept rendering; only make test-integration HOST=ubongo does (passed GREEN). An in-converge re-render can't cheaply cover it (role defaults aren't in scope outside the role run). → decide in kaizen: a second Molecule scenario (molecule/input-only/) asserting forward policy accept, vs accepting the integration-only coverage. Final-review finding; not a cutover blocker (the accept branch is a literal, and a var-name break would fail the drop branch too → caught).
[gotcha] Applying base's firewall to a Docker host flushes Docker's nat → container egress dies until restart docker (2026-06-19, mesh-hardening 2/3 live cutover): base's nftables.conf.j2 starts with flush ruleset, which wipes ALL tables incl. Docker's ip nat/ip filter (+ libvirt's). On ubongo I chose INPUT-only so forward stays accept — yet the apply STILL broke CONTAINER egress: docker pull worked (dockerd uses HOST egress) but a container ping FAILED — the masquerade (SNAT) was gone, so replies couldn't return. forward accept permits forwarding but can't replace the missing nat. The spec's "input-only keeps Docker egress working" was therefore incomplete, and the local-VM harness couldn't catch it (the test VM runs no Docker). Fix on the live host: systemctl restart docker re-adds its ip nat/ip filter (egress restored; coexists fine with base's inet filter). On REBOOT it self-heals (dockerd re-adds nat on boot; forward accept doesn't block — unlike the 2026-06-17 forward drop incident). → (1) any cutover/runbook applying base firewall to a Docker host MUST restart docker + check container egress after the apply; (2) the pending docker_host nftables integration should own re-adding/persisting Docker's rules so base's flush is safe; (3) the firewall final-review checklist should include "does the host run Docker/libvirt? the flush wipes their nat."

[gotcha] inet filter default-deny blocks libvirt dnsmasq DHCP — silent, hard to diagnose (2026-06-19, task-3 integration gate): when base__firewall_input_only: true is applied to ubongo, the table inet filter { chain input { policy drop; } } blocks DHCP packets that arrive via the libvirt bridge (virbr-boma). In nftables, multiple tables at the same hook priority all run independently; an accept verdict in table ip filter LIBVIRT_INP does NOT prevent table inet filter from seeing and dropping the same packet. VMs never got DHCP leases (dnsmasq socket confirmed by strace to never receive POLLIN despite tcpdump seeing the packet on virbr-boma). Diagnosed by temporarily changing inet filter input to policy accept → fd=3 immediately fired. Fix: /etc/nftables.d/10-libvirt-boma.nft drop-in adding iifname "virbr-boma" accept (survives service restarts via include "/etc/nftables.d/*.nft"). → The base role's template needs a base__firewall_trusted_bridges variable so this is encoded at the Ansible level, not in a manual host drop-in. Every host that runs Docker or libvirt and also has base__firewall_input_only: true needs an analogous exception.
[gotcha] libvirt leaseshelper PID-file permission: virPidFileReleasePath unlinks /run/leaseshelper.pid after EVERY call; nobody cannot recreate it (2026-06-19, task-3 integration gate): dnsmasq runs as nobody; libvirt_leaseshelper is its --dhcp-script. The helper acquires a PID-file mutex at /run/leaseshelper.pid, but virPidFileReleasePath UNLINKS the file on exit. /run/ is root:root 755, so nobody cannot create the file after the first unlink → every subsequent add call fails with errno=13, dnsmasq silently drops the DHCP grant (no log, no error to the client). Fix: suid root C wrapper at /usr/lib/libvirt/libvirt_leaseshelper (original moved to .real) that pre-creates /run/leaseshelper.pid owned by nobody, then drops privileges and execs the real helper. The root dnsmasq fork calls the wrapper; suid gives it permission to touch /run/; on return to nobody uid the PID file stays. Also: /var/lib/libvirt/dnsmasq/ must be nobody:nogroup 775 so leaseshelper can update virbr-boma.status. This fix is host-local on ubongo and NOT in Ansible — encode it in an integration_test role task (or a libvirt role) before the harness can be safely re-deployed.
[gotcha] cloud-init rejects underscores in local-hostname → silently skips network-config → VM never gets DHCP (2026-06-19, task-3 integration gate): setting local-hostname: boma-it-askari_inputonly-<uuid> caused cloud-init-local to consider the hostname invalid and skip writing the network-config to the system. Systemd-networkd then used the genericcloud default (no DHCP), so VMs got only IPv6 link-local. Fix in scripts/integration-vm.py: name.replace("_", "-") in the meta-data hostname (disk paths and virsh domain names keep the original underscore). Sanitization rule: RFC-952 hostnames allow hyphens, not underscores.
[friction] Molecule Docker image can't apt install → roles with real package tasks have no Molecule substrate coverage (2026-06-19): the Docker Molecule image ships with cleared apt-lists and no internet access, so any role whose core work is apt install — base, docker_host, integration_test — cannot cover its package/substrate tasks in Molecule. Those tasks are validated only by make test-integration (ADR-025, real KVM). The gap is systemic: it affects every role with non-trivial package or system-level setup. → systematization idea: provide a Molecule image or driver that can install packages (e.g. a custom Docker image with pre-seeded apt-lists, or a prepare.yml that pre-installs packages from a local cache), or an alternative driver (e.g. molecule-libvirt using the same KVM harness), so substrate tasks get real Molecule unit coverage rather than relying entirely on the integration harness.

Kaizen reviews — decisions ledger

Consumed signals and where their resolution now lives. Newest first.

2026-06-17

Second /kaizen run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items (the rename-incomplete scan check and the Forgejo registry-login path) were built by parallel subagents and verified against the diff. Bias-to-remove note: one PARK (the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero REMOVE; the rest accreted (migrate/change). None of the open signals were [unused] tooling, so there was nothing to delete — the only reductive move available was parking the out-of-phase build. Cadence: healthy — 3 days after the first run, every signal 0–2 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in scripts/friction-scan.py didn't fire this pass (all recurrence counts were 1), so the thresholds need no change.

Signal (first seen)	Verdict	Resolution / where it lives now
ADRs claim cross-doc reconciliation they didn't perform (06-14)	SYSTEMATIZE	New `rename-incomplete` check in `scripts/repo-scan.py` (+7 tests): when a numbered ADR announces a rename `Old`→`New`, flag any design-doc line where `Old` still appears in present tense (skips the announcing ADR, lines also naming `New`, and historical/negation cues; rejects `ADR-NNN` tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of `stale-deferred`; run by `/review-repo`. (Was KEEP-OPEN on 2026-06-14 — now built.)
Image push to the Forgejo registry needs an interactive `docker login` (06-15)	SYSTEMATIZE → vault	Vault-backed login path so pushes are agent-completable: `vault.forgejo.registry_token` stub (CHANGEME, operator-minted) + `scripts/registry-login.sh` (reads the token, `docker login --password-stdin`, never echoes it) + `make registry-login` + a prereq note in `docs/runbooks/claude-code-setup.md`. Works once the operator fills the token via `make edit-vault`.
Single-file bind mount + atomic rewrite = stale config (06-16)	SYSTEMATIZE	→ `docs/testing/gotchas.md` — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": `template` writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config directory for reload-in-place roles; restart-based roles are fine with a single-file mount.
`make check` always fails on the first-ever deploy of a compose service role (06-16)	CHANGE	`check_mode: false` on the `state: directory` scaffold tasks in `roles/reverse_proxy` + `roles/netbird_coordinator`, so the base dirs exist under `--check` and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing `project_src`. Inert under converge → Molecule unchanged.
Re-asked settled defaults — push + execution mode, in prose (06-17)	CHANGE (exec) + ACCEPTED (push)	Widened `.claude/hooks/guard-execution-mode-menu.sh` to also catch free-form prose re-asks of the subagent-vs-inline choice (`"which execution approach?"`, `"subagent vs inline"`, …), not just the literal menu; tested. The push re-ask stays a soft default via the `dont-reask-settled-defaults` memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked.
Docs-only commit tripped the rbw-locked pre-commit guard (06-17)	CHANGE	Root cause was NOT the ansible-lint `files:` scope (innocent) — it was `.claude/hooks/guard-vault-preflight.sh` blocking every locked `git commit`. Rewrote it to inspect the staged set (`git diff --cached`, plus `-a`/`--all`) and block only when Ansible content (`^(roles\|playbooks\|inventories)/.*\.ya?ml$`) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested.
Agent can't self-manage `ubongo` (the control node it runs on) without operator grants (06-17)	PARK	The knowledge already lives in `STATUS.md` (control-node row: the interim `claude`-key + `sjat` NOPASSWD grants, and Pending: the proper `ansible`-user bootstrap) and the `ubongo-self-sufficiency` memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. Resurrection trigger: when building ubongo's `base` hardening / `ansible`-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants.

2026-06-14

First /kaizen run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above — a repo-scan.py check is its own build). Bias-to-remove note: zero PARK/REMOVE — none of the open signals were [unused] tooling; they were all knowledge/gotchas/process, which migrate or archive (knowledge is never deleted).

Signal (first seen)	Verdict	Resolution / where it lives now
Execution-mode menu asked AGAIN — 5× (06-05→06-14)	ALREADY-BUILT	The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is verified firing on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect.
Brainstorming spec-review gate fires despite the standing agreement (06-10)	CHANGE → mechanical	Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu.
Subagent faithfulness self-reports can be wrong (06-10)	ACCEPTED	The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs.
ADR-writing policy unsettled (05-31)	ALREADY-BUILT	ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal.
Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14)	ALREADY-BUILT → RESOLVED 2026-06-15	06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral closed — root cause was version skew (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`.
`apply:{tags}` not propagated by dynamic `include_tasks` (06-14)	SYSTEMATIZE	→ `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`".
Molecule CAN test tag-propagation, via a tagged converge (06-14)	SYSTEMATIZE	→ `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule".
apply=false Molecule + data-pytest gap for API/templating roles (06-14)	SYSTEMATIZE	→ `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call".
`item.values` in a loop sends the dict method, not the key (06-14)	SYSTEMATIZE	→ CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`").
TF child modules need their own `required_providers` (06-14)	SYSTEMATIZE	→ CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`").
ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14)	SYSTEMATIZE	→ `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference.
Gandi rejects RFC-7505 null-MX `0 .` (06-14)	MIGRATE	→ `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain).

2026-06-10

Signal (first seen)	Verdict	Resolution / where it lives now
Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10)	CHANGE → mechanical	Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a mechanical guard, not another note.
Every `git commit` needs `rbw` unlock — recurring (05-30)	CHANGE	Root cause was not the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so docs-/config-only commits skip it and need no vault. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted).
`make test` fails when run non-activated — `ansible-config` not found (06-06)	CHANGE	`Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`.
Molecule image missing from the Forgejo registry (06-06)	already built	`make molecule-image-push` target exists.
Deferred decision goes stale across docs — 3× (06-05)	already built	`scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`.
`make new-role` brace-expansion fails under dash (05-30)	fixed	Explicit paths in the Makefile target.
nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06)	MIGRATE	→ `docs/testing/gotchas.md` (pointer from ADR-008).
hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30)	MIGRATE	→ `docs/runbooks/claude-code-setup.md` "Environment gotchas".
`finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01)	accepted	Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs.

Process note: the 2026-06-10 review was manual (the /retro//kaizen tool wasn't built). The 2026-06-14 block was the first run of /kaizen itself (scripts/friction-scan.py Phase 0 + .claude/commands/kaizen.md); the dogfood both cleared the backlog and validated the command.

31 KiB Raw Blame History Unescape Escape