boma/docs/FRICTION.md
sjat 4c8fb9e03b docs: M5 mesh enrollment — ubongo + askari on the mesh
STATUS: base mesh concern built + applied; ubongo (100.99.146.14) + askari
(100.99.226.39) enrolled, link verified; ubongo agent-management access (sjat key
+ NOPASSWD sudo) recorded. ROADMAP M5: infra done, laptops = operator step,
mesh-hardening split out as the deferred follow-on. FRICTION: docs-only-commit rbw
guard + control-node self-management access gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 16:40:02 +02:00

148 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# FRICTION.md — kaizen friction log
Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
the input that keeps our tooling and conventions sharpening over time instead of only
accreting.
**How to use:** append freely _during_ work under **Open signals** — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the **decisions ledger** below.
**Entry format:** `date — [tag] observation — (optional) → systematization idea`
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
`[recurring]` keeps coming back, should be systematized · `[unused]` tooling not
earning its keep.
---
## Open signals
_(append new raw signals here; the next kaizen review consumes them)_
- `[friction]` **Image push to the Forgejo registry fails with `no basic auth
credentials`** (2026-06-15): `make caddy-image-push` (and `molecule-image-push`) fail
unless the Docker daemon on ubongo has an interactive `docker login
forgejo.nyumbani.baobab.band` session — and those creds are **not in vault** (only
`gandi` + `hetzner` are), so an agent can't complete a push non-interactively. The
build half is fully automatable; the push half silently requires a human. → candidate:
document the `docker login` step in `docs/runbooks/claude-code-setup.md`, **or** store
a scoped Forgejo registry token in vault + a `make registry-login` target (login via
`--password-stdin`, `no_log`) so pushes are agent-completable like every other
vault-backed action.
- `[gotcha]` **Single-file Docker bind mount + atomic config rewrite = stale config in
the running container** (2026-06-16): `reverse_proxy` bind-mounted the Caddyfile as a
single file; `ansible.builtin.template` writes atomically (temp + rename → new inode),
so the running container kept the OLD inode and `caddy reload` (in-container, no restart)
re-read stale config and silently no-op'd (`"config is unchanged"`). The NetBird route
never loaded → Caddy never requested its cert; surfaced only by a TLS handshake failure.
Fix: mount the config **directory** (`./caddy` → `/etc/caddy`) — directory mounts reflect
inode swaps, so live reload works (proven on askari). NOTE the sibling case: NetBird also
single-file-mounts `config.yaml`, but its handler does `docker compose restart` (not an
in-container reload), and a restart DOES re-resolve the bind mount (verified: 0 before,
1 after) — so restart-based roles are safe; only in-place-reload roles need the dir mount.
→ candidate gotcha doc (`docs/testing/gotchas.md`): "reload-in-place needs a directory
mount; restart-based roles are fine with a single-file mount."
- `[friction]` **`make check` always fails on the first-ever deploy of a compose service
role** (2026-06-16): in check mode the "ensure base_dir" task is reported-but-not-run, so
the later `community.docker.docker_compose_v2` up fails with `"…is not a directory"`
(missing `project_src`). Not a defect — a real deploy creates the dir — but it means the
CLAUDE.md "always `make check` before `make deploy`" step is guaranteed-red for any brand
new stateful role, which erodes trust in the check. → candidate: guard the compose-up with
`not ansible_check_mode` (clean "skipped" in dry-run; compose can't be meaningfully
dry-run before first deploy anyway), OR document the one-time expected failure. Decide one.
- `[recurring]` **Re-asked the operator about settled defaults — push + execution mode**
(2026-06-17): at the M5 plan handoff I asked (a) whether to push to origin and (b) which
execution mode (subagent-driven vs inline) — both already settled: CLAUDE.md says push to
`origin` often (off-machine backup), and TODO 10.5 / the standing agreement is "always
subagent-driven" (there's even `guard-execution-mode-menu.sh`). Same shape as the 5×
"execution-mode menu asked AGAIN" ledger entries — but this time the ask was my own
free-form prose ("want those pushed now?", "which execution approach?"), which the
existing menu-text matcher does NOT catch (it keys on the writing-plans menu's literal
text). → the gap is that the guard only matches that literal menu; free-form re-asks slip
through. Candidate: widen the Stop-hook matcher to also flag prose re-asks of
push-vs-not / subagent-vs-inline, since prose reminders have already failed this many
times. Default behaviour: **push as backup and proceed subagent-driven without asking.**
- `[friction]` **A docs-only commit still tripped the `rbw`-locked pre-commit guard**
(2026-06-17): committing only `docs/superpowers/specs/*.md` (no ansible content) was
blocked needing the vault password, although the 2026-06-10 kaizen fix scoped the
pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so
docs-/config-only commits skip it and need no vault. So either the hook's `files:`
pattern still matches `docs/**` (or `.md`), or a blanket pre-commit step needs the
vault regardless. → check `.pre-commit-config.yaml`'s `files:`/`exclude:` against the
spec/plan paths; docs-only commits should not require `rbw`.
- `[friction]` **The agent can't manage `ubongo` (the control node it runs ON) without
the operator granting access** (2026-06-17): enrolling `ubongo` in the mesh needed two
manual operator grants because the agent runs as `claude` (no sudo) but the inventory
manages `ubongo` as `sjat`: (1) `claude`'s SSH key added to `sjat`'s `authorized_keys`
(`Permission denied (publickey)` otherwise), then (2) `NOPASSWD` sudo for `sjat`
(`Missing sudo password` otherwise). So the "AI-worker control node" (ADR-015) can drive
the whole fleet but not itself, unattended. This is the **pending `ansible`-user
bootstrap** gap (STATUS) biting in practice. → the proper fix is ubongo's bootstrap to a
key-trusted, NOPASSWD `ansible` (or `sjat`) management identity as part of `base`/its
control-node recipe, so control-node self-management doesn't need ad-hoc operator grants.
- `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
(2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said
Traefik; the rename was left half-done across the doc set and the ADR over-claimed its
own follow-through. Surfaced only by a full-repo `grep Traefik` during `/review-repo`.
Same shape as the deferred-decision-goes-stale signal (a decision lands in one place,
its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text
asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the
old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`.
(KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.)
---
## Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first.
### 2026-06-14
First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. |
| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |
### 2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
built). The 2026-06-14 block was the **first run of `/kaizen`** itself
(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
cleared the backlog and validated the command.