Documents three blockers found while developing the askari_inputonly
integration-test profile:
1. inet filter default-deny silently blocks libvirt dnsmasq DHCP: nftables
multi-table independence means ip filter LIBVIRT_INP accept does NOT
prevent inet filter drop. Diagnosed via strace; fixed with a drop-in.
2. libvirt leaseshelper PID-file: virPidFileReleasePath unlinks the file after
every call; nobody cannot recreate in /run/. Fix: suid root C wrapper.
3. cloud-init rejects underscores in local-hostname → skips network-config
→ no DHCP. Fix: sanitize with replace("_", "-") in meta-data hostname.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
30 KiB
FRICTION.md — kaizen friction log
Raw signals for the periodic kaizen review (/kaizen; see docs/TODO.md 11). This is
the input that keeps our tooling and conventions sharpening over time instead of only
accreting.
How to use: append freely during work under Open signals — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. /kaizen reads this, then proposes a verdict per signal
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
toward remove/park for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the decisions ledger below.
Entry format: date — [tag] observation — (optional) → systematization idea
Tags: [friction] recurring annoyance · [gotcha] surprising behaviour ·
[recurring] keeps coming back, should be systematized · [unused] tooling not
earning its keep.
Open signals
(append new raw signals here; the next kaizen review consumes them)
[friction]Re-asked settled defaults (push + subagent-driven) at the plan→execute handoff (2026-06-19): despite the standing preference (memorydont-reask-settled-defaults: push to origin as off-machine backup and go subagent-driven, both WITHOUT asking), I again asked the operator "which execution approach?" and "want me to push?". Thewriting-plansskill scripts that handoff question ("Which approach?"), and confirming a push felt natural — both overrode the memory. → at the writing-plans → execution handoff, default to subagent-driven execution and push to origin without a confirmation gate; reserve questions for genuine forks. Recurrence of an already-recorded signal — treat the skill's scripted "Which approach?" as pre-answered (subagent-driven) for this operator.
-
[gotcha]base's nftablesforward policy dropbreaks Docker hosts on reboot (2026-06-17):base/templates/nftables.conf.j2setschain forward { ... policy drop; }. On a Docker host, container traffic is forwarded (published-port DNAT → container, and inter-container over the bridge), so the drop kills it. It worked right aftermake deploy(Docker's runtime rules coexisted) but after a reboot nftables loaded our default-deny before Docker, breaking WAN→Caddy and Caddy→coordinator → the public services and the mesh went down. Thedocker_host"nftables.dcontainer-forward rules" that would make this Docker-safe are explicitly pending (STATUS.md). → thebasefirewall (base__firewall_apply) must NOT be applied to any Docker host untildocker_hostships the container-forward rules; add a guard/check (a Docker host withfirewall_apply: trueand no container-forward drop-in is a misconfiguration), and the firewall design (ADR-020) should state the Docker-host dependency explicitly. -
[gotcha]ip_nonlocal_binddid NOT beat the sshd boot-race (2026-06-17): the mesh-hardening plan bound sshdListenAddressto thewt0IP and setnet.ipv4.ip_nonlocal_bind=1so sshd could bind the mesh IP beforewt0exists at boot. In practice the console still showed sshd "could not assign the address" at boot — so the protection did not work as designed, and becausewt0never came up (the coordinator was down), sshd had no listener at all → no SSH path. → the entire "sshd listens onwt0only" premise is unsound without (a) a verified boot-race fix and (b) a guaranteed non-mesh break-glass. Re-investigate whyip_nonlocal_binddidn't help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?), or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping. -
[gotcha]The coordinator host can't bootstrap the mesh it depends on (2026-06-17):askariruns the NetBird coordinator AND is a mesh peer. After a reboot its NetBird agent needs the coordinator (a local container) to be serving to bring upwt0— but the coordinator wasn't healthy, sowt0never came up. Circular. Combined with sshd beingwt0-only, the host was reachable only via the Hetzner console. → the coordinator host must keep a non-mesh management path always (don't move its SSH ontowt0), or the mesh-hardening must treat the coordinator host as a special case. General rule: never make a host's only management path depend on a service that host itself hosts. -
[gotcha]NetBirdnetbird-serverFATAL-loops on the geolocation DB download with no egress (2026-06-17): on startup the combinednetbird-server:0.72.4tries to download the GeoLite2 DB frompkgs.netbird.ioand treats failure as FATAL (crash-loop) — so any loss of container egress (here: Docker NAT masquerade wiped whennftableswas flushed, not re-added by a plainrestart docker) takes the whole control plane down. Recovery wasrestart docker(rebuild NAT) → force-recreate the container so it could download. → for thenetbird_coordinatorrole: pre-seed/persist the geo DB in the data dir (or pin a local copy), or disable the geolocation requirement, so a transient egress blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT) is fragile acrossnft flush+ reboot. -
[friction]No off-site coordinator backup turned a 2-minute restore into a long live recovery (2026-06-17): the NetBird coordinator's stateful store (/var/lib/netbird, encrypted SQLite) has no off-site backup yet (ADR-022backuprole pending, flagged in STATUS as the coordinator's deferred backup). During the incident there was a real fear the unclean reboots had corrupted the store, with no restore path. It turned out to be a runtime/egress issue, not corruption — but the absence of a backup made the whole recovery higher-stakes. → prioritise the ADR-022 backup contract for thenetbird_coordinatorstore ahead of the rest of the backup role; a recent off-host copy would have made "rebuild askari from scratch" a safe option. -
[friction]The plan tested reboot-recovery AFTER removing the break-glass (2026-06-17): the mesh-hardening plan's live cutover closed the WAN:22(step 5) before the reboot-resilience test (step 7), so the one fallback path was gone exactly when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for lockout-risky cutovers: validate reboot-recovery while the old access path is still open, and only retire the break-glass once recovery (incl. a reboot) is proven. Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
-
[gotcha]Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS (2026-06-18):virt-install --importof the genericcloud qcow2 with the default (SeaBIOS) firmware triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test VMs via UEFI (virt-install --boot uefi; OVMF→efistub). -
[friction]The no-sudoclaudemodel blocked diagnosing a failed VM (2026-06-18): under ADR-015claudehad no sudo, so when the VM wouldn't network there was no way to introspect it (serial logs areroot:0600, libguestfs not installed, mounting needs root). Diagnosis was fully blocked until the operator grantedclaudesudo. → DECISION:claudegetsNOPASSWD:ALL(reverses ADR-015's "no local sudo"); compensating control is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks; codify the sudoers drop-in in Ansible. -
[gotcha]Non-rootvirsh/virt-installdefault toqemu:///session(2026-06-18): the substrate (NAT net, /dev/kvm) lives onqemu:///system. → pinLIBVIRT_DEFAULT_URI=qemu:///systemin the driver. -
[gotcha]qemu:///system(libvirt-qemu) can't traverse/home(2026-06-18): VM disk/seed/console under the repo/home failed "Permission denied (search permissions for /home/claude)". → put per-VM artifacts in a system-readable dir (/var/lib/boma-integration, group libvirt); the inventory (read by ansible as the user) can stay in the repo. -
[gotcha]ansible-playbook -i <dir>/parses sibling non-inventory files as INI (2026-06-18): pointing-iat a run-dir holding a state file + qcow2s made the directory inventory loader parse the state file as INI → phantom hosts INCLUDING the realaskari(with its real vars), breaking the single-host isolation invariant. → point-iat the singlehosts.yml. Caught by the holistic cross-file review BEFORE any hardware run. -
[gotcha]Jinja{%- -%}+ ansibletrim_blocks=Truedouble-strip newlines (2026-06-18): a template edit used{%- -%}, reviewed by rendering with RAW jinja2 (trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain{% %}at column 0 (the repo's existing style). -
[gotcha]Fresh cloud images have empty apt lists (2026-06-18):apt install nftablesfailed "No package matching 'nftables' is available" on a fresh genericcloud VM whose cloud-init hadpackage_update: false. →package_update: trueAND block oncloud-init status --waitbefore applying. -
[gotcha]base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is allowed (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway (192.168.150.1).ct established,related acceptsaves the in-flight apply connection, but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay setsbase__firewall_control_addrto the NAT gateway. -
[recurring]Real-hardware shakedown and static review each caught what the other couldn't (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial cross-file review AND a real-hardware run; neither alone would have shipped it working.
-
[friction]Raw DHCP leases pinned in ubongo's host firewall (admin-addr SSH allows) (2026-06-19): mesh-hardening 2/3 lets the operator workstations reach ubongo's LAN SSH by raw lease —base__firewall_admin_addrs: ["10.20.10.50" (mamba), "10.20.10.17"]— because there is no DHCP reservation yet (OPNsense isn't managed as code). A lease reassignment silently moves the allow to whatever host next holds the IP (still SSH-key-gated) and drops the workstation's LAN path (mesh still works, so never a full lockout). → when OPNsense-as-code lands (ADR-020 perimeter / TODO 3.5), replace both with MAC-pinned DHCP reservations (10.20.10.17= MACbc:0f:f3:c8:4a:8a; mamba's MAC TBD) and allow the reserved IPs. Spec:docs/superpowers/specs/2026-06-19-mesh-hardening-ubongo-default-deny-design.md. -
[gotcha]make test-integrationon ubongo fails (qemu-img"Permission denied") when the agent session predates thelibvirtgroup grant (2026-06-19): theintegration_testrole addsclaudetolibvirt+kvmand makes the cache dir/var/lib/boma-integrationroot:libvirt 2775— correct — but aclaudesession whose shell started before that grant carries a stale process group set (id→claude,dockeronly, nolibvirt), soqemu-img createof the VM overlay into the group-owned dir is denied.virsh/virt-installstill work (they reach system libvirtd via polkit/socket, and the real KVM runs server-side aslibvirt-qemu), so ONLY claude's own file-writes break. Unblock without restarting the session:sg libvirt -c 'make test-integration HOST=<name>'(claude needs onlylibvirtfor the dir;kvmis server-side; notesgadds one group, not the full set). → self-heal inscripts/integration-vm.py: if thelibvirtgid is absent fromos.getgroups(), re-exec undersg libvirt(or have the Makefile target do it), so a stale-session agent never hits this opaque symptom. New agent sessions pick the groups up on login, so it's a stale-session transient — but high-confusion, worth self-healing. -
[friction]No standard for when the agent may run local-VM integration tests on ubongo without asking (2026-06-19):make test-integration HOST=<name>spins an ISOLATED throwaway KVM VM (its own libvirt NAT; never touches the real host's firewall/network; guards: one-VM-at-a-time + a 4 GiB free-RAM floor + auto-destroy on success), so it is safe and self-contained — yet the agent paused for a go-ahead before running it (mesh-hardening 2/3, Task 4). The operator wants a STANDARD that pre-authorises VM-testing on ubongo so the agent just runs it. → decide + record the rule: e.g. a.claude/settings.jsonpermission allow formake test-integration*/scripts/integration-vm.py(and thesg libvirt -c '…'form per the gotcha above), plus a CLAUDE.md line distinguishing the pre-authorised isolated VM tests from the genuinely-gated live steps (make deployto real hosts, host reboots, cutovers — still need a go-ahead). Ties to thetest-risky-infra-before-live-deploy+dont-reask-settled-defaultsmemories + ADR-025. -
[gotcha]Molecule covers only theinput_only-OFF (forward drop) branch of the base firewall (2026-06-19): mesh-hardening 2/3 addedbase__firewall_input_only(forward policy drop↔accept). ThedefaultMolecule scenario renders ONE fixture, set to the secure default (drop) — so the fastmake test ROLE=basegate locks the drop default (security-critical for service hosts) but does NOT exercise the=true→ forward-acceptrendering; onlymake test-integration HOST=ubongodoes (passed GREEN). An in-converge re-render can't cheaply cover it (role defaults aren't in scope outside the role run). → decide in kaizen: a second Molecule scenario (molecule/input-only/) asserting forwardpolicy accept, vs accepting the integration-only coverage. Final-review finding; not a cutover blocker (the accept branch is a literal, and a var-name break would fail the drop branch too → caught). -
[gotcha]Applying base's firewall to a Docker host flushes Docker's nat → container egress dies untilrestart docker(2026-06-19, mesh-hardening 2/3 live cutover): base'snftables.conf.j2starts withflush ruleset, which wipes ALL tables incl. Docker'sip nat/ip filter(+ libvirt's). On ubongo I chose INPUT-only soforwardstaysaccept— yet the apply STILL broke CONTAINER egress:docker pullworked (dockerd uses HOST egress) but a containerpingFAILED — the masquerade (SNAT) was gone, so replies couldn't return.forward acceptpermits forwarding but can't replace the missing nat. The spec's "input-only keeps Docker egress working" was therefore incomplete, and the local-VM harness couldn't catch it (the test VM runs no Docker). Fix on the live host:systemctl restart dockerre-adds itsip nat/ip filter(egress restored; coexists fine with base'sinet filter). On REBOOT it self-heals (dockerd re-adds nat on boot;forward acceptdoesn't block — unlike the 2026-06-17forward dropincident). → (1) any cutover/runbook applying base firewall to a Docker host MUSTrestart docker+ check container egress after the apply; (2) the pendingdocker_hostnftables integration should own re-adding/persisting Docker's rules so base'sflushis safe; (3) the firewall final-review checklist should include "does the host run Docker/libvirt? the flush wipes their nat."
-
[gotcha]inet filterdefault-deny blocks libvirt dnsmasq DHCP — silent, hard to diagnose (2026-06-19, task-3 integration gate): whenbase__firewall_input_only: trueis applied to ubongo, thetable inet filter { chain input { policy drop; } }blocks DHCP packets that arrive via the libvirt bridge (virbr-boma). In nftables, multiple tables at the same hook priority all run independently; anacceptverdict intable ip filter LIBVIRT_INPdoes NOT preventtable inet filterfrom seeing and dropping the same packet. VMs never got DHCP leases (dnsmasq socket confirmed by strace to never receive POLLIN despite tcpdump seeing the packet onvirbr-boma). Diagnosed by temporarily changinginet filter inputtopolicy accept→ fd=3 immediately fired. Fix:/etc/nftables.d/10-libvirt-boma.nftdrop-in addingiifname "virbr-boma" accept(survives service restarts viainclude "/etc/nftables.d/*.nft"). → Thebaserole's template needs abase__firewall_trusted_bridgesvariable so this is encoded at the Ansible level, not in a manual host drop-in. Every host that runs Docker or libvirt and also hasbase__firewall_input_only: trueneeds an analogous exception. -
[gotcha]libvirtleaseshelperPID-file permission:virPidFileReleasePathunlinks/run/leaseshelper.pidafter EVERY call; nobody cannot recreate it (2026-06-19, task-3 integration gate): dnsmasq runs as nobody;libvirt_leaseshelperis its--dhcp-script. The helper acquires a PID-file mutex at/run/leaseshelper.pid, butvirPidFileReleasePathUNLINKS the file on exit./run/isroot:root 755, so nobody cannot create the file after the first unlink → every subsequentaddcall fails witherrno=13, dnsmasq silently drops the DHCP grant (no log, no error to the client). Fix: suid root C wrapper at/usr/lib/libvirt/libvirt_leaseshelper(original moved to.real) that pre-creates/run/leaseshelper.pidowned by nobody, then drops privileges and execs the real helper. The root dnsmasq fork calls the wrapper; suid gives it permission to touch/run/; on return to nobody uid the PID file stays. Also:/var/lib/libvirt/dnsmasq/must benobody:nogroup 775so leaseshelper can updatevirbr-boma.status. This fix is host-local on ubongo and NOT in Ansible — encode it in anintegration_testrole task (or a libvirt role) before the harness can be safely re-deployed. -
[gotcha]cloud-init rejects underscores inlocal-hostname→ silently skips network-config → VM never gets DHCP (2026-06-19, task-3 integration gate): settinglocal-hostname: boma-it-askari_inputonly-<uuid>caused cloud-init-local to consider the hostname invalid and skip writing the network-config to the system. Systemd-networkd then used the genericcloud default (no DHCP), so VMs got only IPv6 link-local. Fix inscripts/integration-vm.py:name.replace("_", "-")in the meta-data hostname (disk paths and virsh domain names keep the original underscore). Sanitization rule: RFC-952 hostnames allow hyphens, not underscores.
Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first.
2026-06-17
Second /kaizen run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items
(the rename-incomplete scan check and the Forgejo registry-login path) were built by
parallel subagents and verified against the diff. Bias-to-remove note: one PARK
(the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero
REMOVE; the rest accreted (migrate/change). None of the open signals were [unused]
tooling, so there was nothing to delete — the only reductive move available was parking
the out-of-phase build. Cadence: healthy — 3 days after the first run, every signal
0–2 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in
scripts/friction-scan.py didn't fire this pass (all recurrence counts were 1), so the
thresholds need no change.
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| ADRs claim cross-doc reconciliation they didn't perform (06-14) | SYSTEMATIZE | New rename-incomplete check in scripts/repo-scan.py (+7 tests): when a numbered ADR announces a rename Old→New, flag any design-doc line where Old still appears in present tense (skips the announcing ADR, lines also naming New, and historical/negation cues; rejects ADR-NNN tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of stale-deferred; run by /review-repo. (Was KEEP-OPEN on 2026-06-14 — now built.) |
Image push to the Forgejo registry needs an interactive docker login (06-15) |
SYSTEMATIZE → vault | Vault-backed login path so pushes are agent-completable: vault.forgejo.registry_token stub (CHANGEME, operator-minted) + scripts/registry-login.sh (reads the token, docker login --password-stdin, never echoes it) + make registry-login + a prereq note in docs/runbooks/claude-code-setup.md. Works once the operator fills the token via make edit-vault. |
| Single-file bind mount + atomic rewrite = stale config (06-16) | SYSTEMATIZE | → docs/testing/gotchas.md — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": template writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config directory for reload-in-place roles; restart-based roles are fine with a single-file mount. |
make check always fails on the first-ever deploy of a compose service role (06-16) |
CHANGE | check_mode: false on the state: directory scaffold tasks in roles/reverse_proxy + roles/netbird_coordinator, so the base dirs exist under --check and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing project_src. Inert under converge → Molecule unchanged. |
| Re-asked settled defaults — push + execution mode, in prose (06-17) | CHANGE (exec) + ACCEPTED (push) | Widened .claude/hooks/guard-execution-mode-menu.sh to also catch free-form prose re-asks of the subagent-vs-inline choice ("which execution approach?", "subagent vs inline", …), not just the literal menu; tested. The push re-ask stays a soft default via the dont-reask-settled-defaults memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked. |
| Docs-only commit tripped the rbw-locked pre-commit guard (06-17) | CHANGE | Root cause was NOT the ansible-lint files: scope (innocent) — it was .claude/hooks/guard-vault-preflight.sh blocking every locked git commit. Rewrote it to inspect the staged set (git diff --cached, plus -a/--all) and block only when Ansible content (^(roles|playbooks|inventories)/.*\.ya?ml$) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested. |
Agent can't self-manage ubongo (the control node it runs on) without operator grants (06-17) |
PARK | The knowledge already lives in STATUS.md (control-node row: the interim claude-key + sjat NOPASSWD grants, and Pending: the proper ansible-user bootstrap) and the ubongo-self-sufficiency memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. Resurrection trigger: when building ubongo's base hardening / ansible-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants. |
2026-06-14
First /kaizen run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a repo-scan.py check is its own build). Bias-to-remove note: zero PARK/REMOVE — none
of the open signals were [unused] tooling; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (.claude/hooks/guard-execution-mode-menu.sh, wired in .claude/settings.json) is verified firing on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in superpowers:subagent-driven-development, used for the /kaizen build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + docs/decisions/adr-template.md settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → RESOLVED 2026-06-15 | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral closed — root cause was version skew (pre-Bearer libdns/gandi sent Gandi's deprecated Apikey header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + roles/reverse_proxy. |
apply:{tags} not propagated by dynamic include_tasks (06-14) |
SYSTEMATIZE | → docs/testing/gotchas.md — "Tags on dynamic include_tasks need apply:". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → docs/testing/gotchas.md — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → docs/testing/gotchas.md — "API / templating roles: render-only tests miss the real call". |
item.values in a loop sends the dict method, not the key (06-14) |
SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with item['key'], never item.key"). |
TF child modules need their own required_providers (06-14) |
SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own required_providers in versions.tf"). |
ansible-lint var-naming rejects access__/backup__ cross-role names (06-14) |
SYSTEMATIZE | → make new-role scaffolds a noqa reminder in defaults/main.yml; ADR-004's service-role section documents the convention; roles/reverse_proxy/defaults/main.yml is the reference. |
Gandi rejects RFC-7505 null-MX 0 . (06-14) |
MIGRATE | → roles/public_dns/README.md Notes (no MX + SPF -all + DMARC reject for a no-mail domain). |
2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in .claude/settings.json blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a mechanical guard, not another note. |
Every git commit needs rbw unlock — recurring (05-30) |
CHANGE | Root cause was not the vault syntax-check (.ansible-lint already excludes vault.yml); it was ansible-lint auto-loading + decrypting inventories/production/group_vars/all/vault.yml via the wired vault_password_file. Scoped the pre-commit ansible-lint hook (always_run: false + files: ansible content) so docs-/config-only commits skip it and need no vault. Ansible-content commits still need rbw (intrinsic to linting vault-backed plays; accepted). |
make test fails when run non-activated — ansible-config not found (06-06) |
CHANGE | Makefile test/test-all now prepend $(CURDIR)/.venv/bin to PATH. |
| Molecule image missing from the Forgejo registry (06-06) | already built | make molecule-image-push target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | scripts/repo-scan.py open-deferred-item / stale-deferred checks, run by /review-repo. |
make new-role brace-expansion fails under dash (05-30) |
fixed | Explicit paths in the Makefile target. |
nft iif vs iifname, Molecule ansible_host, apply-path coverage blind spot, render-nft -c pattern (06-06) |
MIGRATE | → docs/testing/gotchas.md (pointer from ADR-008). |
hooks-need-restart, pre-commit stashes unstaged, rbw sync stale cache, zsh word-split (05-30) |
MIGRATE | → docs/runbooks/claude-code-setup.md "Environment gotchas". |
finishing-a-development-branch offers open-a-PR vs our trunk-based merge (06-01) |
accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
Process note: the 2026-06-10 review was manual (the /retro//kaizen tool wasn't
built). The 2026-06-14 block was the first run of /kaizen itself
(scripts/friction-scan.py Phase 0 + .claude/commands/kaizen.md); the dogfood both
cleared the backlog and validated the command.