Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move all 7 open signals out of FRICTION.md's Open-signals section into the new 2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to accept an empty Open-signals section (the goal state after a kaizen pass). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
12 KiB
FRICTION.md — kaizen friction log
Raw signals for the periodic kaizen review (/kaizen; see docs/TODO.md 11). This is
the input that keeps our tooling and conventions sharpening over time instead of only
accreting.
How to use: append freely during work under Open signals — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. /kaizen reads this, then proposes a verdict per signal
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
toward remove/park for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the decisions ledger below.
Entry format: date — [tag] observation — (optional) → systematization idea
Tags: [friction] recurring annoyance · [gotcha] surprising behaviour ·
[recurring] keeps coming back, should be systematized · [unused] tooling not
earning its keep.
Open signals
(append new raw signals here; the next kaizen review consumes them)
Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first.
2026-06-17
Second /kaizen run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items
(the rename-incomplete scan check and the Forgejo registry-login path) were built by
parallel subagents and verified against the diff. Bias-to-remove note: one PARK
(the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero
REMOVE; the rest accreted (migrate/change). None of the open signals were [unused]
tooling, so there was nothing to delete — the only reductive move available was parking
the out-of-phase build. Cadence: healthy — 3 days after the first run, every signal
0–2 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in
scripts/friction-scan.py didn't fire this pass (all recurrence counts were 1), so the
thresholds need no change.
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| ADRs claim cross-doc reconciliation they didn't perform (06-14) | SYSTEMATIZE | New rename-incomplete check in scripts/repo-scan.py (+7 tests): when a numbered ADR announces a rename Old→New, flag any design-doc line where Old still appears in present tense (skips the announcing ADR, lines also naming New, and historical/negation cues; rejects ADR-NNN tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of stale-deferred; run by /review-repo. (Was KEEP-OPEN on 2026-06-14 — now built.) |
Image push to the Forgejo registry needs an interactive docker login (06-15) |
SYSTEMATIZE → vault | Vault-backed login path so pushes are agent-completable: vault.forgejo.registry_token stub (CHANGEME, operator-minted) + scripts/registry-login.sh (reads the token, docker login --password-stdin, never echoes it) + make registry-login + a prereq note in docs/runbooks/claude-code-setup.md. Works once the operator fills the token via make edit-vault. |
| Single-file bind mount + atomic rewrite = stale config (06-16) | SYSTEMATIZE | → docs/testing/gotchas.md — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": template writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config directory for reload-in-place roles; restart-based roles are fine with a single-file mount. |
make check always fails on the first-ever deploy of a compose service role (06-16) |
CHANGE | check_mode: false on the state: directory scaffold tasks in roles/reverse_proxy + roles/netbird_coordinator, so the base dirs exist under --check and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing project_src. Inert under converge → Molecule unchanged. |
| Re-asked settled defaults — push + execution mode, in prose (06-17) | CHANGE (exec) + ACCEPTED (push) | Widened .claude/hooks/guard-execution-mode-menu.sh to also catch free-form prose re-asks of the subagent-vs-inline choice ("which execution approach?", "subagent vs inline", …), not just the literal menu; tested. The push re-ask stays a soft default via the dont-reask-settled-defaults memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked. |
| Docs-only commit tripped the rbw-locked pre-commit guard (06-17) | CHANGE | Root cause was NOT the ansible-lint files: scope (innocent) — it was .claude/hooks/guard-vault-preflight.sh blocking every locked git commit. Rewrote it to inspect the staged set (git diff --cached, plus -a/--all) and block only when Ansible content (^(roles|playbooks|inventories)/.*\.ya?ml$) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested. |
Agent can't self-manage ubongo (the control node it runs on) without operator grants (06-17) |
PARK | The knowledge already lives in STATUS.md (control-node row: the interim claude-key + sjat NOPASSWD grants, and Pending: the proper ansible-user bootstrap) and the ubongo-self-sufficiency memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. Resurrection trigger: when building ubongo's base hardening / ansible-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants. |
2026-06-14
First /kaizen run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a repo-scan.py check is its own build). Bias-to-remove note: zero PARK/REMOVE — none
of the open signals were [unused] tooling; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (.claude/hooks/guard-execution-mode-menu.sh, wired in .claude/settings.json) is verified firing on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in superpowers:subagent-driven-development, used for the /kaizen build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + docs/decisions/adr-template.md settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → RESOLVED 2026-06-15 | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral closed — root cause was version skew (pre-Bearer libdns/gandi sent Gandi's deprecated Apikey header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + roles/reverse_proxy. |
apply:{tags} not propagated by dynamic include_tasks (06-14) |
SYSTEMATIZE | → docs/testing/gotchas.md — "Tags on dynamic include_tasks need apply:". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → docs/testing/gotchas.md — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → docs/testing/gotchas.md — "API / templating roles: render-only tests miss the real call". |
item.values in a loop sends the dict method, not the key (06-14) |
SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with item['key'], never item.key"). |
TF child modules need their own required_providers (06-14) |
SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own required_providers in versions.tf"). |
ansible-lint var-naming rejects access__/backup__ cross-role names (06-14) |
SYSTEMATIZE | → make new-role scaffolds a noqa reminder in defaults/main.yml; ADR-004's service-role section documents the convention; roles/reverse_proxy/defaults/main.yml is the reference. |
Gandi rejects RFC-7505 null-MX 0 . (06-14) |
MIGRATE | → roles/public_dns/README.md Notes (no MX + SPF -all + DMARC reject for a no-mail domain). |
2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in .claude/settings.json blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a mechanical guard, not another note. |
Every git commit needs rbw unlock — recurring (05-30) |
CHANGE | Root cause was not the vault syntax-check (.ansible-lint already excludes vault.yml); it was ansible-lint auto-loading + decrypting inventories/production/group_vars/all/vault.yml via the wired vault_password_file. Scoped the pre-commit ansible-lint hook (always_run: false + files: ansible content) so docs-/config-only commits skip it and need no vault. Ansible-content commits still need rbw (intrinsic to linting vault-backed plays; accepted). |
make test fails when run non-activated — ansible-config not found (06-06) |
CHANGE | Makefile test/test-all now prepend $(CURDIR)/.venv/bin to PATH. |
| Molecule image missing from the Forgejo registry (06-06) | already built | make molecule-image-push target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | scripts/repo-scan.py open-deferred-item / stale-deferred checks, run by /review-repo. |
make new-role brace-expansion fails under dash (05-30) |
fixed | Explicit paths in the Makefile target. |
nft iif vs iifname, Molecule ansible_host, apply-path coverage blind spot, render-nft -c pattern (06-06) |
MIGRATE | → docs/testing/gotchas.md (pointer from ADR-008). |
hooks-need-restart, pre-commit stashes unstaged, rbw sync stale cache, zsh word-split (05-30) |
MIGRATE | → docs/runbooks/claude-code-setup.md "Environment gotchas". |
finishing-a-development-branch offers open-a-PR vs our trunk-based merge (06-01) |
accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
Process note: the 2026-06-10 review was manual (the /retro//kaizen tool wasn't
built). The 2026-06-14 block was the first run of /kaizen itself
(scripts/friction-scan.py Phase 0 + .claude/commands/kaizen.md); the dogfood both
cleared the backlog and validated the command.