# FRICTION.md — kaizen friction log Raw signals for the periodic **kaizen review** (the methodology retrospective; see `docs/TODO.md`). This is the input that keeps our tooling and conventions sharpening over time instead of only accreting. **How to use:** append freely _during_ work — don't curate, don't fix here. Capture friction, surprises, fixes that keep recurring, and tooling that isn't earning its keep. The kaizen review reads this, then proposes **add / change / remove** (biased toward _remove_) and records the decisions as ADRs. **Entry format:** `date — [tag] observation — (optional) → systematization idea` Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour · `[recurring]` keeps coming back, should be systematized · `[unused]` tooling not earning its keep. --- ## 2026-05-30 — initial seed (from the Claude-Code setup session) - `[recurring]` Every `git commit` needs `rbw` unlocked (the pre-commit ansible-lint hook decrypts `vault.yml` for its syntax-check). Mitigated with a 5h lock timeout and an `rbw unlocked` pre-flight convention. → _Open:_ could ansible-lint skip vault decryption for syntax-check, so committing doesn't need the vault at all? - `[gotcha]` pre-commit stashes _unstaged_ changes before running hooks, so a partial commit reverted an interdependent file (`ansible.cfg`) and failed. → Commit interdependent changes together, or stage the config change first. - `[gotcha]` `make new-role` had never worked on this host: `mkdir {a,b,c}` brace expansion fails under `/bin/sh` (dash). Fixed with explicit paths. → A real run catches what static review can't; consider smoke-testing scaffold commands. - `[gotcha]` `rbw sync` is required after adding a Vaultwarden item before `rbw get` finds it (stale local cache). - `[gotcha]` This shell is zsh — unquoted `$VAR` does not word-split, so a variable holding a file list was passed as a single argument. → Use explicit args/arrays. - `[friction]` Long sessions: I make a batch of edits but can't commit until you `rbw unlock`. The 5h timeout + pre-flight check address the symptom; watch whether it still bites. - `[gotcha]` Hooks (or any new `.claude/settings.json`) added mid-session don't activate until a Claude Code **restart** — the settings watcher only tracks settings files that existed at session start. Opening `/hooks` and dismissing did _not_ load them. → Fresh sessions load them normally; restart after adding hooks. ## 2026-05-31 - I asked to draft an ADR and got: No formal status-header convention, but since this is a draft for discussion I'll mark it Proposed so it isn't mistaken for an accepted decision. Here's the draft. ## 2026-06-01 - `[friction]` The `finishing-a-development-branch` flow (and generic AI/dev tooling) offers "push and open a Pull Request," but our Forgejo `origin` is trunk-based with no merge-request / approval gate (CLAUDE.md git conventions). That option doesn't apply — the real path is local fast-forward merge to `main`, then push. → Skills and conventions that assume a GitHub-style PR workflow need a homelab-aware variant; encode that here "finishing a branch" means merge-locally-then-push, not open-a-PR. ## 2026-06-05 - `[recurring]` The `writing-plans` skill ends by asking "subagent-driven vs inline execution?" — always answer subagent-driven here. Don't ask; default straight to subagent-driven (fresh subagent per task + review between tasks). → Standing preference; skip the execution-mode prompt. - `[recurring]` When a **deferred** decision later resolves, docs that referenced the deferral go stale and a plan's file-map can miss them (e.g. resolving the mesh-VPN choice left `new-host.md` still saying "mesh VPN (choice deferred)"; the ubongo work similarly left a contradiction in CLAUDE.md). A _broadened_ final grep sweep caught both. → On resolving a deferred decision, grep all canonical docs for the deferral language ("choice deferred", "pending", "TBD", the placeholder's name) and reconcile every hit — don't rely on the plan's file-map alone. Worth a `/review-repo` check for lingering "deferred/pending/TBD" references whose ADR has since resolved. - **Recurred a 3rd time (same day):** ADR-017 resolved the browser-E2E harness but left ADR-015's own "Deferred" list item #2 still reading as open — not caught by the ADR-017 plan's sweep (which only checked for _its own_ placeholder language), only by a later STATUS pass. Lesson sharpened: the stale reference often lives in the **originating ADR's Deferred section**, which the resolving ADR's plan won't think to grep. → When an ADR resolves another ADR's deferred item, edit that **source ADR's Deferred list** in the same change. Three hits now — promote from "worth a check" to **build it**: a `/review-repo` rule flagging any ADR "Deferred/Open" entry whose subject is named as RESOLVED/DECIDED elsewhere. ## 2026-06-06 - `[recurring]` **Asked the execution-mode question AGAIN** ("subagent-driven vs inline — which approach?") at the end of `writing-plans`, despite the 2026-06-05 standing preference _and_ the `always-subagent-driven-execution` memory both saying don't ask. Root cause: the `writing-plans` skill's "Execution Handoff" step scripts the menu, and I followed the skill text over the user's standing override. Second occurrence → escalate from "skip the prompt" to a **hard rule**: never present the execution-mode menu; finishing a plan means defaulting straight to subagent-driven. - `[friction]` **Don't pause for approval between writing a plan and implementing it.** The user has standing pre-approval to carry straight through plan → implementation. The brainstorming/plan flow already has explicit approval gates (design approval, spec review); adding another "shall I proceed to implement?" gate after the plan is written is redundant friction. → After `writing-plans` finishes, begin subagent-driven implementation directly. The only reason to stop is a genuine blocker or ambiguity, not a routine checkpoint. ### Host nftables firewall build (`base` role) - `[gotcha]` **`nft -c` rejects `iif ""` when the interface is absent** (it resolves to an interface _index_ at load time). The render+syntax-check Molecule step caught `iif "wt0"` failing in the container — and it would fail identically on any real host before NetBird brings up `wt0`. Use **`iifname ""`** (string match, no existence requirement, survives the interface coming/going) for any interface that may be absent. - `[gotcha]` **Molecule's `community.docker` connection uses `ansible_host` as the container name** (`remote_addr`). Setting `ansible_host` as _data_ in a scenario's `host_vars` (e.g. to give a resolver a fake IP) breaks the connection → `UNREACHABLE`, "Failed to create temporary directory". Don't override `ansible_host` in molecule; feed fixture IPs another way (or keep fixtures to zone sources and unit-test IP resolution). - `[recurring]` **`make test ROLE=` needs the venv on PATH.** Run non-activated (as agents do), molecule dies with `FileNotFoundError: 'ansible-config'` — it shells out to `ansible-config`/`ansible-playbook` by bare name. Workaround: `PATH="$PWD/.venv/bin:$PATH" .venv/bin/molecule test`. Also the molecule image wasn't in the Forgejo registry (pull → "not found"); had to `make molecule-image` to build it locally. → Consider (a) the Makefile `test` target prepending `.venv/bin` to PATH, and (b) `make molecule-image-push` so a fresh checkout can pull it. - `[gotcha]` **Apply-only task paths have no Level-1 coverage**, so safety bugs hide there. The `nft` auto-rollback snapshot used a bare `nft list ruleset` (no leading `flush ruleset`) → the revert was a silent no-op on first apply and errored on later ones; the whole safety net was dead. Molecule never runs the apply (gated off), so only adversarial review + an isolated-netns round-trip test caught it. → For apply/safety paths molecule can't exercise, validate out-of-band (a throwaway `--privileged` container with its own netns) and treat a final adversarial review as mandatory, not optional. - `[note]` The render-and-`nft -c` (no-apply) Molecule approach **earned its keep** — caught the `iif`/`iifname` bug deterministically without touching the host kernel. Good pattern to reuse for other config-rendering roles. ## 2026-06-09 - `[recurring]` **Asked the execution-mode question AGAIN** — presented the "subagent-driven vs inline" menu at the `writing-plans` → execution handoff, even though the standing 2026-06-05 preference and the `always-subagent-driven-execution` memory both say to default to subagent-driven without asking. Third occurrence; the earlier "hard rule" escalation didn't hold because both `writing-plans` and `subagent-driven-development` script the menu and I followed the skill text over the user's standing override. → The standing preference outranks skill scripts: when a skill's handoff offers the execution-mode menu, skip it and proceed subagent-driven; only ask if the user signals otherwise this session.