boma/docs/FRICTION.md

# FRICTION.md — kaizen friction log

Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
the input that keeps our tooling and conventions sharpening over time instead of only
accreting.

**How to use:** append freely _during_ work under **Open signals** — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the **decisions ledger** below.

**Entry format:** `date — [tag] observation — (optional) → systematization idea`
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
`[recurring]` keeps coming back, should be systematized · `[unused]` tooling not
earning its keep.

---

## Open signals

_(append new raw signals here; the next kaizen review consumes them)_

- `[friction]` **Image push to the Forgejo registry fails with `no basic auth
  credentials`** (2026-06-15): `make caddy-image-push` (and `molecule-image-push`) fail
  unless the Docker daemon on ubongo has an interactive `docker login
  forgejo.nyumbani.baobab.band` session — and those creds are **not in vault** (only
  `gandi` + `hetzner` are), so an agent can't complete a push non-interactively. The
  build half is fully automatable; the push half silently requires a human. → candidate:
  document the `docker login` step in `docs/runbooks/claude-code-setup.md`, **or** store
  a scoped Forgejo registry token in vault + a `make registry-login` target (login via
  `--password-stdin`, `no_log`) so pushes are agent-completable like every other
  vault-backed action.

- `[gotcha]` **Single-file Docker bind mount + atomic config rewrite = stale config in
  the running container** (2026-06-16): `reverse_proxy` bind-mounted the Caddyfile as a
  single file; `ansible.builtin.template` writes atomically (temp + rename → new inode),
  so the running container kept the OLD inode and `caddy reload` (in-container, no restart)
  re-read stale config and silently no-op'd (`"config is unchanged"`). The NetBird route
  never loaded → Caddy never requested its cert; surfaced only by a TLS handshake failure.
  Fix: mount the config **directory** (`./caddy` → `/etc/caddy`) — directory mounts reflect
  inode swaps, so live reload works (proven on askari). NOTE the sibling case: NetBird also
  single-file-mounts `config.yaml`, but its handler does `docker compose restart` (not an
  in-container reload), and a restart DOES re-resolve the bind mount (verified: 0 before,
  1 after) — so restart-based roles are safe; only in-place-reload roles need the dir mount.
  → candidate gotcha doc (`docs/testing/gotchas.md`): "reload-in-place needs a directory
  mount; restart-based roles are fine with a single-file mount."

- `[friction]` **`make check` always fails on the first-ever deploy of a compose service
  role** (2026-06-16): in check mode the "ensure base_dir" task is reported-but-not-run, so
  the later `community.docker.docker_compose_v2` up fails with `"…is not a directory"`
  (missing `project_src`). Not a defect — a real deploy creates the dir — but it means the
  CLAUDE.md "always `make check` before `make deploy`" step is guaranteed-red for any brand
  new stateful role, which erodes trust in the check. → candidate: guard the compose-up with
  `not ansible_check_mode` (clean "skipped" in dry-run; compose can't be meaningfully
  dry-run before first deploy anyway), OR document the one-time expected failure. Decide one.

- `[recurring]` **Re-asked the operator about settled defaults — push + execution mode**
  (2026-06-17): at the M5 plan handoff I asked (a) whether to push to origin and (b) which
  execution mode (subagent-driven vs inline) — both already settled: CLAUDE.md says push to
  `origin` often (off-machine backup), and TODO 10.5 / the standing agreement is "always
  subagent-driven" (there's even `guard-execution-mode-menu.sh`). Same shape as the 5×
  "execution-mode menu asked AGAIN" ledger entries — but this time the ask was my own
  free-form prose ("want those pushed now?", "which execution approach?"), which the
  existing menu-text matcher does NOT catch (it keys on the writing-plans menu's literal
  text). → the gap is that the guard only matches that literal menu; free-form re-asks slip
  through. Candidate: widen the Stop-hook matcher to also flag prose re-asks of
  push-vs-not / subagent-vs-inline, since prose reminders have already failed this many
  times. Default behaviour: **push as backup and proceed subagent-driven without asking.**

- `[friction]` **A docs-only commit still tripped the `rbw`-locked pre-commit guard**
  (2026-06-17): committing only `docs/superpowers/specs/*.md` (no ansible content) was
  blocked needing the vault password, although the 2026-06-10 kaizen fix scoped the
  pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so
  docs-/config-only commits skip it and need no vault. So either the hook's `files:`
  pattern still matches `docs/**` (or `.md`), or a blanket pre-commit step needs the
  vault regardless. → check `.pre-commit-config.yaml`'s `files:`/`exclude:` against the
  spec/plan paths; docs-only commits should not require `rbw`.

- `[friction]` **The agent can't manage `ubongo` (the control node it runs ON) without
  the operator granting access** (2026-06-17): enrolling `ubongo` in the mesh needed two
  manual operator grants because the agent runs as `claude` (no sudo) but the inventory
  manages `ubongo` as `sjat`: (1) `claude`'s SSH key added to `sjat`'s `authorized_keys`
  (`Permission denied (publickey)` otherwise), then (2) `NOPASSWD` sudo for `sjat`
  (`Missing sudo password` otherwise). So the "AI-worker control node" (ADR-015) can drive
  the whole fleet but not itself, unattended. This is the **pending `ansible`-user
  bootstrap** gap (STATUS) biting in practice. → the proper fix is ubongo's bootstrap to a
  key-trusted, NOPASSWD `ansible` (or `sjat`) management identity as part of `base`/its
  control-node recipe, so control-node self-management doesn't need ad-hoc operator grants.

- `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
  (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
  Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said
  Traefik; the rename was left half-done across the doc set and the ADR over-claimed its
  own follow-through. Surfaced only by a full-repo `grep Traefik` during `/review-repo`.
  Same shape as the deferred-decision-goes-stale signal (a decision lands in one place,
  its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text
  asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the
  old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`.
  (KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.)

---

## Kaizen reviews — decisions ledger

Consumed signals and where their resolution now lives. Newest first.

### 2026-06-14

First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).

| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. |
| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |

### 2026-06-10

| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |

**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
built). The 2026-06-14 block was the **first run of `/kaizen`** itself
(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
cleared the backlog and validated the command.
-												Add kaizen friction log and schedule the kaizen-loop setup

docs/FRICTION.md: a running log of friction/gotchas/recurring-fixes/unused tooling,
seeded with this session's real signals — raw material for the periodic kaizen
review. docs/TODO.md: schedule building /retro in ~1 week, and record the Claude-setup
decision. (Also carries your earlier backlog edits.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-30 22:05:40 +02:00
+								# FRICTION.md — kaizen friction log
-												chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 21:46:23 +02:00
+								Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
 								the input that keeps our tooling and conventions sharpening over time instead of only
 								accreting.
-												Add kaizen friction log and schedule the kaizen-loop setup

docs/FRICTION.md: a running log of friction/gotchas/recurring-fixes/unused tooling,
seeded with this session's real signals — raw material for the periodic kaizen
review. docs/TODO.md: schedule building /retro in ~1 week, and record the Claude-setup
decision. (Also carries your earlier backlog edits.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-30 22:05:40 +02:00
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								**How to use:** append freely _during_ work under **Open signals** — don't curate,
 								don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
-												chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 21:46:23 +02:00
+								that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
 								(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
 								toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
 								and moves consumed signals into the **decisions ledger** below.
-												Add kaizen friction log and schedule the kaizen-loop setup

docs/FRICTION.md: a running log of friction/gotchas/recurring-fixes/unused tooling,
seeded with this session's real signals — raw material for the periodic kaizen
review. docs/TODO.md: schedule building /retro in ~1 week, and record the Claude-setup
decision. (Also carries your earlier backlog edits.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-30 22:05:40 +02:00
 								**Entry format:** `date — [tag] observation — (optional) → systematization idea`
 								Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
 								`[recurring]` keeps coming back, should be systematized · `[unused]` tooling not
 								earning its keep.
 								---
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								## Open signals
-												Add kaizen friction log and schedule the kaizen-loop setup

docs/FRICTION.md: a running log of friction/gotchas/recurring-fixes/unused tooling,
seeded with this session's real signals — raw material for the periodic kaizen
review. docs/TODO.md: schedule building /retro in ~1 week, and record the Claude-setup
decision. (Also carries your earlier backlog edits.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-30 22:05:40 +02:00
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								_(append new raw signals here; the next kaizen review consumes them)_
-												Log Forgejo no-PR-workflow friction in FRICTION.md

Forgejo origin is trunk-based with no merge-request gate, so the
finishing-a-development-branch "open a PR" option doesn't apply — merge
locally then push. Also carries earlier uncommitted FRICTION.md edits
(emphasis normalization + 2026-05-31 ADR-status entry).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-01 11:22:26 +02:00
-												docs(friction): log registry-push auth gotcha (no creds in vault)

Building images is fully automatable; pushing to the Forgejo registry needs an
interactive docker login, and registry creds aren't in vault — so an agent can't
complete a push. Captured for the next kaizen review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-15 06:58:45 +02:00
+								- `[friction]` **Image push to the Forgejo registry fails with `no basic auth
 								  credentials`** (2026-06-15): `make caddy-image-push` (and `molecule-image-push`) fail
 								  unless the Docker daemon on ubongo has an interactive `docker login
 								  forgejo.nyumbani.baobab.band` session — and those creds are **not in vault** (only
 								  `gandi` + `hetzner` are), so an agent can't complete a push non-interactively. The
 								  build half is fully automatable; the push half silently requires a human. → candidate:
 								  document the `docker login` step in `docs/runbooks/claude-code-setup.md`, **or** store
 								  a scoped Forgejo registry token in vault + a `make registry-login` target (login via
 								  `--password-stdin`, `no_log`) so pushes are agent-completable like every other
 								  vault-backed action.
-												docs(netbird): M4b done — STATUS/ROADMAP/risks/friction

netbird_coordinator built + applied to askari (first service role, dashboard live).
STATUS: new "real and working" row + askari/coordinator rows updated. ROADMAP: M4b
done, M5 (peer enrol) next, recorded the v0.72.4 combined-container/embedded-Dex/
no-Coturn reality. accepted-risks R3: Coturn -> STUN wording. FRICTION: single-file
bind-mount stale-inode gotcha + check-before-first-deploy artifact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-16 07:48:53 +02:00
+								- `[gotcha]` **Single-file Docker bind mount + atomic config rewrite = stale config in
 								  the running container** (2026-06-16): `reverse_proxy` bind-mounted the Caddyfile as a
 								  single file; `ansible.builtin.template` writes atomically (temp + rename → new inode),
 								  so the running container kept the OLD inode and `caddy reload` (in-container, no restart)
 								  re-read stale config and silently no-op'd (`"config is unchanged"`). The NetBird route
 								  never loaded → Caddy never requested its cert; surfaced only by a TLS handshake failure.
 								  Fix: mount the config **directory** (`./caddy` → `/etc/caddy`) — directory mounts reflect
 								  inode swaps, so live reload works (proven on askari). NOTE the sibling case: NetBird also
 								  single-file-mounts `config.yaml`, but its handler does `docker compose restart` (not an
 								  in-container reload), and a restart DOES re-resolve the bind mount (verified: 0 before,
 after) — so restart-based roles are safe; only in-place-reload roles need the dir mount.
 								  → candidate gotcha doc (`docs/testing/gotchas.md`): "reload-in-place needs a directory
 								  mount; restart-based roles are fine with a single-file mount."
 								- `[friction]` **`make check` always fails on the first-ever deploy of a compose service
 								  role** (2026-06-16): in check mode the "ensure base_dir" task is reported-but-not-run, so
 								  the later `community.docker.docker_compose_v2` up fails with `"…is not a directory"`
 								  (missing `project_src`). Not a defect — a real deploy creates the dir — but it means the
 								  CLAUDE.md "always `make check` before `make deploy`" step is guaranteed-red for any brand
 								  new stateful role, which erodes trust in the check. → candidate: guard the compose-up with
 								  `not ansible_check_mode` (clean "skipped" in dry-run; compose can't be meaningfully
 								  dry-run before first deploy anyway), OR document the one-time expected failure. Decide one.
-												docs(friction): re-asked operator about push + execution mode (settled)

I re-surfaced two already-settled decisions as questions (push to origin; subagent
vs inline) at the M5 handoff. The existing execution-mode guard only matches the
writing-plans menu's literal text, so free-form prose re-asks slip through. Default:
push as backup and go subagent-driven without asking.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-17 15:58:26 +02:00
+								- `[recurring]` **Re-asked the operator about settled defaults — push + execution mode**
 								  (2026-06-17): at the M5 plan handoff I asked (a) whether to push to origin and (b) which
 								  execution mode (subagent-driven vs inline) — both already settled: CLAUDE.md says push to
 								  `origin` often (off-machine backup), and TODO 10.5 / the standing agreement is "always
 								  subagent-driven" (there's even `guard-execution-mode-menu.sh`). Same shape as the 5×
 								  "execution-mode menu asked AGAIN" ledger entries — but this time the ask was my own
 								  free-form prose ("want those pushed now?", "which execution approach?"), which the
 								  existing menu-text matcher does NOT catch (it keys on the writing-plans menu's literal
 								  text). → the gap is that the guard only matches that literal menu; free-form re-asks slip
 								  through. Candidate: widen the Stop-hook matcher to also flag prose re-asks of
 								  push-vs-not / subagent-vs-inline, since prose reminders have already failed this many
 								  times. Default behaviour: **push as backup and proceed subagent-driven without asking.**
-												docs: M5 mesh enrollment — ubongo + askari on the mesh

STATUS: base mesh concern built + applied; ubongo (100.99.146.14) + askari
(100.99.226.39) enrolled, link verified; ubongo agent-management access (sjat key
+ NOPASSWD sudo) recorded. ROADMAP M5: infra done, laptops = operator step,
mesh-hardening split out as the deferred follow-on. FRICTION: docs-only-commit rbw
guard + control-node self-management access gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-17 16:40:02 +02:00
+								- `[friction]` **A docs-only commit still tripped the `rbw`-locked pre-commit guard**
 								  (2026-06-17): committing only `docs/superpowers/specs/*.md` (no ansible content) was
 								  blocked needing the vault password, although the 2026-06-10 kaizen fix scoped the
 								  pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so
 								  docs-/config-only commits skip it and need no vault. So either the hook's `files:`
 								  pattern still matches `docs/**` (or `.md`), or a blanket pre-commit step needs the
 								  vault regardless. → check `.pre-commit-config.yaml`'s `files:`/`exclude:` against the
 								  spec/plan paths; docs-only commits should not require `rbw`.
 								- `[friction]` **The agent can't manage `ubongo` (the control node it runs ON) without
 								  the operator granting access** (2026-06-17): enrolling `ubongo` in the mesh needed two
 								  manual operator grants because the agent runs as `claude` (no sudo) but the inventory
 								  manages `ubongo` as `sjat`: (1) `claude`'s SSH key added to `sjat`'s `authorized_keys`
 								  (`Permission denied (publickey)` otherwise), then (2) `NOPASSWD` sudo for `sjat`
 								  (`Missing sudo password` otherwise). So the "AI-worker control node" (ADR-015) can drive
 								  the whole fleet but not itself, unattended. This is the **pending `ansible`-user
 								  bootstrap** gap (STATUS) biting in practice. → the proper fix is ubongo's bootstrap to a
 								  key-trusted, NOPASSWD `ansible` (or `sjat`) management identity as part of `base`/its
 								  control-node recipe, so control-node self-management doesn't need ad-hoc operator grants.
-												docs(friction): log 2026-06-14 review+follow-up signals

Three new Open signals: ansible-lint no-role-prefix vs ADR-021/022 access__/
backup__ conventions (first service role); Molecule tag-propagation now testable
via tagged converge + full-then-partial; ADRs over-claiming cross-doc reconciliation
(repo-scan check candidate, cousin of stale-deferred).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 20:28:15 +02:00
+								- `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
 								  (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
 								  Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said
 								  Traefik; the rename was left half-done across the doc set and the ADR over-claimed its
 								  own follow-through. Surfaced only by a full-repo `grep Traefik` during `/review-repo`.
 								  Same shape as the deferred-decision-goes-stale signal (a decision lands in one place,
 								  its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text
 								  asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the
 								  old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`.
-												chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 21:46:23 +02:00
+								  (KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.)
-												docs(friction): log 2026-06-14 review+follow-up signals

Three new Open signals: ansible-lint no-role-prefix vs ADR-021/022 access__/
backup__ conventions (first service role); Molecule tag-propagation now testable
via tagged converge + full-then-partial; ADRs over-claiming cross-doc reconciliation
(repo-scan check candidate, cousin of stale-deferred).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 20:28:15 +02:00
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								---
-												docs(friction): record host-nftables build gotchas (iif/iifname, molecule ansible_host, venv PATH, apply-path coverage)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-06 19:16:21 +02:00
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								## Kaizen reviews — decisions ledger
-												docs(friction): log execution-mode recurrence; fix list de-indents

Complete the 2026-06-09 entry (third recurrence of presenting the
execution-mode menu despite the standing subagent-driven preference) and
restore two continuation-line indents a markdown formatter had stripped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 08:54:37 +02:00
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								Consumed signals and where their resolution now lives. Newest first.
-												docs(friction): log execution-mode recurrence; fix list de-indents

Complete the 2026-06-09 entry (third recurrence of presenting the
execution-mode menu despite the standing subagent-driven preference) and
restore two continuation-line indents a markdown formatter had stripped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 08:54:37 +02:00
-												chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 21:46:23 +02:00
+								### 2026-06-14
 								First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
 								a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
 								of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
 								which migrate or archive (knowledge is never deleted).
 								| Signal (first seen) | Verdict | Resolution / where it lives now |
 								|---|---|---|
 								| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
 								| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
 								| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
 								| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
-												docs: record Caddy/Gandi DNS-01 as resolved + proven (was M4a deferral)

ADR-024 Status/Consequences, STATUS.md, ROADMAP M4a, and the FRICTION ledger now
record that the DNS-01 path is built and proven, with the root cause of the M4a
failure (version skew: pre-Bearer libdns/gandi sent the deprecated Apikey header;
plus building on a Hetzner IP). Traefik was reconsidered and rejected again — lego's
Gandi provider has the same PAT-vs-Apikey question, so it would not have helped.

Dated review reports and spec/plan snapshots are left as historical records.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-15 06:57:55 +02:00
+								| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. |
-												chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 21:46:23 +02:00
+								| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
 								| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
 								| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
 								| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
 								| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
 								| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
 								| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								### 2026-06-10
-												docs(friction): log execution-mode ask (4th occurrence)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 11:06:25 +02:00
-												docs(kaizen): migrate gotchas to docs; curate FRICTION log (2026-06-10 review)

- New docs/testing/gotchas.md (nft iif/iifname, Molecule ansible_host,
  apply-path coverage blind spot, render-nft-c pattern); pointer from ADR-008.
- claude-code-setup.md gains "Environment gotchas" (hooks-need-restart,
  pre-commit stashes unstaged, rbw sync cache, zsh word-split).
- FRICTION.md restructured into Open signals + a decisions ledger; consumed
  signals archived with where their resolution now lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 12:51:39 +02:00
+								| Signal (first seen) | Verdict | Resolution / where it lives now |
 								|---|---|---|
 								| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
 								| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
 								| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
 								| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
 								| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
 								| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
 								| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
 								| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
 								| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
-												docs(friction): log execution-mode ask (4th occurrence)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-10 11:06:25 +02:00
-												chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

											
										
										
											2026-06-14 21:46:23 +02:00
+								**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
 								built). The 2026-06-14 block was the **first run of `/kaizen`** itself
 								(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
 								cleared the backlog and validated the command.