11 KiB
FRICTION.md — kaizen friction log
Raw signals for the periodic kaizen review (the methodology retrospective; see
docs/TODO.md). This is the input that keeps our tooling and conventions sharpening
over time instead of only accreting.
How to use: append freely during work under Open signals — don't curate, don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling that isn't earning its keep. The kaizen review reads this, then proposes add / change / remove (biased toward remove), migrates durable knowledge into the right docs, and moves consumed signals into the decisions ledger below.
Entry format: date — [tag] observation — (optional) → systematization idea
Tags: [friction] recurring annoyance · [gotcha] surprising behaviour ·
[recurring] keeps coming back, should be systematized · [unused] tooling not
earning its keep.
Open signals
(append new raw signals here; the next kaizen review consumes them)
-
[gotcha]Hetzner IPs are 403'd by Google's Go module infra; caddy-dns/gandi DNS-01 didn't issue (2026-06-14, M4a): building the custom Caddy image on askari failed —proxy.golang.organdgolang.orgboth return 403 Forbidden to the Hetzner IP (worked on ubongo). Reworked the role to build on the control node +docker save/loadto the target. Then thecaddy-dns/gandiDNS-01 plugin would not create the_acme-challengeTXT despite a token verified to (a) be in Caddy's env and (b) create TXT records via the Gandi API directly — no plugin error, just "propagation timeout, last error "; resolvers/timeout tuning didn't help. Resolution: askari is a public host, so switched it to HTTP-01 + vanilla Caddy (works, drops the custom image entirely). DNS-01 deferred to Phase 2 (cluster's mesh/LAN-only services) — the plugin + the Hetzner-build-block to be solved then. → lesson: prefer HTTP-01 wherever a host is publicly reachable; reserve DNS-01 (and its plugin/build complexity) for hosts that genuinely can't do HTTP-01. Both bugs surfaced only on the live host. -
[gotcha]A tag oninclude_tasksdoes NOT reach the included tasks — needapply: {tags:}(2026-06-14): M3'sbase/tasks/main.ymltagged the ssh/fail2baninclude_taskswithhardening, butmake deploy … TAGS=hardeningran nothing (ok=3 changed=0) — a tag on a dynamic include selects the include, not its contents. Fix:include_tasks: {file: x.yml, apply: {tags: [hardening]}}. The same latent bug sat in the firewall include (never hit — firewall was only ever run untagged). Also the check-mode artifact: aservice/handler for a not-yet-installed package fails in a first-run--check→ guard withwhen: not ansible_check_mode. Both caught only by the livemake check/deployon askari — Molecule converges untagged, so it can't catch tag-propagation. 3rd reinforcement (after M1item.values, M2 TFrequired_providers) that live execution catches what review + container tests miss. → when a role uses tags to apply concern-subsets,apply:is mandatory on its includes; consider an ansible-lint/CI check thatmake deploy … TAGS=<concern>actually changes things. -
[gotcha]Terraform child modules need their ownrequired_providersfor non-hashicorp providers (2026-06-14):terraform initfor theoffsiteenv failed — thehetzner_vmmodule usedhcloud_*resources with norequired_providersblock, so TF inferredhashicorp/hcloud(nonexistent). Theproxmox_vmmodule had the identical latent bug, never caught because Proxmox TF was neverinited. Both the terraform-MCP schema check and the final review subagent missed it; onlymake tf-init/planon ubongo caught it. Reinforces the M1 signal that live/real execution catches what static review can't — now for Terraform. → always give a TF module its ownversions.tfwithrequired_providers; treat "reviewed but never run" as a structural blind spot. -
[gotcha]item.valuesin a loop sends the dict's.values()METHOD, not the key (2026-06-14): thepublic_dnsrole looped over records that have avalues:key and used{{ item.values }}in thegandi_livednstask. Jinja attribute access resolveditem.valuesto the built-in dict method, so Gandi received"<built-in method values of dict object at 0x...>"as the live TXT value — corrupt and non-idempotent (the address changes each run → always "changed"). The fix is bracket-indexing:item['values'](same risk for any key namedkeys/items/get/update/...). → convention: in loops, index loop-var keys withitem['key'], neveritem.key; consider an ansible-lint guard. -
[gotcha]Gandi LiveDNS rejects RFC-7505 null-MX0 .(2026-06-14): "invalid format for MX record." Used "no MX + no apex A" + SPF-all+ DMARC reject instead. Minor, but worth a note for any future no-mail domain on Gandi. -
[recurring]apply=false Molecule + data-only pytest leave a real gap for API/templating roles (2026-06-14): both the null-MX and theitem.valuesbugs sailed through the spec, BOTH review subagents, the pytest (validates the data file, not the rendered template), and the Molecule scenario (apply=false, so the API tasks never run) — only the livemake check/deployagainst the real Gandi API surfaced them. For roles whose payload is "render data → external API call", the rendered template is the thing that breaks, and nothing short of a real (or check-mode) API call exercises it. → for such roles, treat a check-mode run against the real API as a required gate, not an optional final step; or build a render-only assertion that materializes the module args. -
[recurring]Execution-mode menu asked AGAIN despite the 2026-06-10 "mechanical fix" (2026-06-14): at the M1 (public_dns) plan handoff I presented the "1. Subagent-Driven / 2. Inline Execution — which approach?" menu and asked the user to pick. The decisions ledger (2026-06-10) records this exact behaviour as CHANGE → mechanical: "Stop hook in.claude/settings.jsonblocks the turn if the menu appears and tells me to proceed subagent-driven." It did not fire — either the hook is absent in this clone, its matcher doesn't match the wording thewriting-plansskill actually produces, or it isn't installed/active. The standing agreement is to default straight to subagent-driven without asking. → verify the Stop hook exists and that its pattern matches the real menu text (the skill scripts "Two execution options" / "Which approach?"); if it relies on.claude/settings.jsonhooks that aren't active here, that's the gap. 5th occurrence (06-05/06/09/10/14). -
[friction]ADR-writing policy is unsettled (2026-05-31): drafting an ADR, I invented a Status header ("Proposed") on the fly because there's no documented convention for how we write ADRs (status lifecycle, required sections). → TODO 10.2 — decide a minimal ADR template / status convention. -
[recurring]Brainstorming's "user reviews spec" gate fires despite a standing agreement to skip it (2026-06-10): writing the ADR-structure spec, I stopped to ask the user to review the finished spec before writing the plan — thesuperpowers:brainstormingskill scripts that gate. We had previously agreed I should move directly from the Q/A to the implementation plan once the spec is written. Same shape as the execution-mode-menu signal: an external skill's script conflicting with a boma convention, where prose reminders don't hold. → consider a mechanical guard (Stop-hook family) or a CLAUDE.md/skill-override note that suppresses the spec-review gate. -
[recurring]Subagent faithfulness self-reports can be wrong — controller must diff (2026-06-10): during the ADR-023 retroactive restructure, an implementer subagent reported "0 substantive deletions, the See-also lines reappear verbatim" for ADR-014, but it had actually dropped the cross-reference lines. Caught only by the controller independently runninggit show <sha> | grep '^-[^-]'. For faithfulness-critical edits delegated to subagents, the agent's own audit is not sufficient evidence. → systematize a controller-side deletion-audit step (every-line must be a classified, expected change) before accepting any "presentational-only" restructure; consider a helper script.
Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first.
2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in .claude/settings.json blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a mechanical guard, not another note. |
Every git commit needs rbw unlock — recurring (05-30) |
CHANGE | Root cause was not the vault syntax-check (.ansible-lint already excludes vault.yml); it was ansible-lint auto-loading + decrypting inventories/production/group_vars/all/vault.yml via the wired vault_password_file. Scoped the pre-commit ansible-lint hook (always_run: false + files: ansible content) so docs-/config-only commits skip it and need no vault. Ansible-content commits still need rbw (intrinsic to linting vault-backed plays; accepted). |
make test fails when run non-activated — ansible-config not found (06-06) |
CHANGE | Makefile test/test-all now prepend $(CURDIR)/.venv/bin to PATH. |
| Molecule image missing from the Forgejo registry (06-06) | already built | make molecule-image-push target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | scripts/repo-scan.py open-deferred-item / stale-deferred checks, run by /review-repo. |
make new-role brace-expansion fails under dash (05-30) |
fixed | Explicit paths in the Makefile target. |
nft iif vs iifname, Molecule ansible_host, apply-path coverage blind spot, render-nft -c pattern (06-06) |
MIGRATE | → docs/testing/gotchas.md (pointer from ADR-008). |
hooks-need-restart, pre-commit stashes unstaged, rbw sync stale cache, zsh word-split (05-30) |
MIGRATE | → docs/runbooks/claude-code-setup.md "Environment gotchas". |
finishing-a-development-branch offers open-a-PR vs our trunk-based merge (06-01) |
accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
Process note: the /retro tool (TODO 11) still isn't built, so this review was
manual. Curating by hand (migrate durable knowledge → docs, archive consumed signals →
this ledger) worked well; fold that curation step into /retro when it's built.