boma/docs/FRICTION.md
sjat 13ae674cc9 chore(kaizen): first /kaizen run — curate 12 friction signals
Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:46:23 +02:00

7.7 KiB
Raw Blame History

FRICTION.md — kaizen friction log

Raw signals for the periodic kaizen review (/kaizen; see docs/TODO.md 11). This is the input that keeps our tooling and conventions sharpening over time instead of only accreting.

How to use: append freely during work under Open signals — don't curate, don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling that isn't earning its keep. /kaizen reads this, then proposes a verdict per signal (SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased toward remove/park for unused tooling), migrates durable knowledge into the right docs, and moves consumed signals into the decisions ledger below.

Entry format: date — [tag] observation — (optional) → systematization idea Tags: [friction] recurring annoyance · [gotcha] surprising behaviour · [recurring] keeps coming back, should be systematized · [unused] tooling not earning its keep.


Open signals

(append new raw signals here; the next kaizen review consumes them)

  • [recurring] ADRs claim cross-doc reconciliation they didn't actually perform (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said Traefik; the rename was left half-done across the doc set and the ADR over-claimed its own follow-through. Surfaced only by a full-repo grep Traefik during /review-repo. Same shape as the deferred-decision-goes-stale signal (a decision lands in one place, its promised ripple edits don't). → candidate repo-scan.py check: when an ADR's text asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the old name (or verify the claimed edit landed) — the structural cousin of stale-deferred. (KEEP-OPEN per the 2026-06-14 /kaizen run — it's its own build task.)

Kaizen reviews — decisions ledger

Consumed signals and where their resolution now lives. Newest first.

2026-06-14

First /kaizen run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above — a repo-scan.py check is its own build). Bias-to-remove note: zero PARK/REMOVE — none of the open signals were [unused] tooling; they were all knowledge/gotchas/process, which migrate or archive (knowledge is never deleted).

Signal (first seen) Verdict Resolution / where it lives now
Execution-mode menu asked AGAIN — 5× (06-05→06-14) ALREADY-BUILT The 06-10 mechanical guard (.claude/hooks/guard-execution-mode-menu.sh, wired in .claude/settings.json) is verified firing on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect.
Brainstorming spec-review gate fires despite the standing agreement (06-10) CHANGE → mechanical Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu.
Subagent faithfulness self-reports can be wrong (06-10) ACCEPTED The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in superpowers:subagent-driven-development, used for the /kaizen build itself. Revisit if it recurs.
ADR-writing policy unsettled (05-31) ALREADY-BUILT ADR-023 (ADR structure & lifecycle) + docs/decisions/adr-template.md settle status/sections — both postdate this signal.
Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) ALREADY-BUILT ADR-024's revised Status records the HTTP-01 decision, the DNS-01 deferral to Phase 2, and the Hetzner-build + plugin blocks.
apply:{tags} not propagated by dynamic include_tasks (06-14) SYSTEMATIZE docs/testing/gotchas.md — "Tags on dynamic include_tasks need apply:".
Molecule CAN test tag-propagation, via a tagged converge (06-14) SYSTEMATIZE docs/testing/gotchas.md — "Testing concern-tag isolation in Molecule".
apply=false Molecule + data-pytest gap for API/templating roles (06-14) SYSTEMATIZE docs/testing/gotchas.md — "API / templating roles: render-only tests miss the real call".
item.values in a loop sends the dict method, not the key (06-14) SYSTEMATIZE → CLAUDE.md Ansible conventions ("index loop-var keys with item['key'], never item.key").
TF child modules need their own required_providers (06-14) SYSTEMATIZE → CLAUDE.md Terraform conventions ("every module declares its own required_providers in versions.tf").
ansible-lint var-naming rejects access__/backup__ cross-role names (06-14) SYSTEMATIZE make new-role scaffolds a noqa reminder in defaults/main.yml; ADR-004's service-role section documents the convention; roles/reverse_proxy/defaults/main.yml is the reference.
Gandi rejects RFC-7505 null-MX 0 . (06-14) MIGRATE roles/public_dns/README.md Notes (no MX + SPF -all + DMARC reject for a no-mail domain).

2026-06-10

Signal (first seen) Verdict Resolution / where it lives now
Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) CHANGE → mechanical Stop hook in .claude/settings.json blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a mechanical guard, not another note.
Every git commit needs rbw unlock — recurring (05-30) CHANGE Root cause was not the vault syntax-check (.ansible-lint already excludes vault.yml); it was ansible-lint auto-loading + decrypting inventories/production/group_vars/all/vault.yml via the wired vault_password_file. Scoped the pre-commit ansible-lint hook (always_run: false + files: ansible content) so docs-/config-only commits skip it and need no vault. Ansible-content commits still need rbw (intrinsic to linting vault-backed plays; accepted).
make test fails when run non-activated — ansible-config not found (06-06) CHANGE Makefile test/test-all now prepend $(CURDIR)/.venv/bin to PATH.
Molecule image missing from the Forgejo registry (06-06) already built make molecule-image-push target exists.
Deferred decision goes stale across docs — 3× (06-05) already built scripts/repo-scan.py open-deferred-item / stale-deferred checks, run by /review-repo.
make new-role brace-expansion fails under dash (05-30) fixed Explicit paths in the Makefile target.
nft iif vs iifname, Molecule ansible_host, apply-path coverage blind spot, render-nft -c pattern (06-06) MIGRATE docs/testing/gotchas.md (pointer from ADR-008).
hooks-need-restart, pre-commit stashes unstaged, rbw sync stale cache, zsh word-split (05-30) MIGRATE docs/runbooks/claude-code-setup.md "Environment gotchas".
finishing-a-development-branch offers open-a-PR vs our trunk-based merge (06-01) accepted Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs.

Process note: the 2026-06-10 review was manual (the /retro//kaizen tool wasn't built). The 2026-06-14 block was the first run of /kaizen itself (scripts/friction-scan.py Phase 0 + .claude/commands/kaizen.md); the dogfood both cleared the backlog and validated the command.