# FRICTION.md — kaizen friction log Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is the input that keeps our tooling and conventions sharpening over time instead of only accreting. **How to use:** append freely _during_ work under **Open signals** — don't curate, don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal (SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs, and moves consumed signals into the **decisions ledger** below. **Entry format:** `date — [tag] observation — (optional) → systematization idea` Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour · `[recurring]` keeps coming back, should be systematized · `[unused]` tooling not earning its keep. --- ## Open signals _(append new raw signals here; the next kaizen review consumes them)_ - `[friction]` **Image push to the Forgejo registry fails with `no basic auth credentials`** (2026-06-15): `make caddy-image-push` (and `molecule-image-push`) fail unless the Docker daemon on ubongo has an interactive `docker login forgejo.nyumbani.baobab.band` session — and those creds are **not in vault** (only `gandi` + `hetzner` are), so an agent can't complete a push non-interactively. The build half is fully automatable; the push half silently requires a human. → candidate: document the `docker login` step in `docs/runbooks/claude-code-setup.md`, **or** store a scoped Forgejo registry token in vault + a `make registry-login` target (login via `--password-stdin`, `no_log`) so pushes are agent-completable like every other vault-backed action. - `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform** (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said Traefik; the rename was left half-done across the doc set and the ADR over-claimed its own follow-through. Surfaced only by a full-repo `grep Traefik` during `/review-repo`. Same shape as the deferred-decision-goes-stale signal (a decision lands in one place, its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`. (KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.) --- ## Kaizen reviews — decisions ledger Consumed signals and where their resolution now lives. Newest first. ### 2026-06-14 First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above — a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process, which migrate or archive (knowledge is never deleted). | Signal (first seen) | Verdict | Resolution / where it lives now | |---|---|---| | Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. | | Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. | | Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. | | ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. | | Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. | | `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". | | Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". | | apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". | | `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). | | TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). | | ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. | | Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). | ### 2026-06-10 | Signal (first seen) | Verdict | Resolution / where it lives now | |---|---|---| | Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. | | Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). | | `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. | | Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. | | Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. | | `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. | | nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). | | hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". | | `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. | **Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't built). The 2026-06-14 block was the **first run of `/kaizen`** itself (`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both cleared the backlog and validated the command.