sjat/boma

sjat 4c8fb9e03b docs: M5 mesh enrollment — ubongo + askari on the mesh

STATUS: base mesh concern built + applied; ubongo (100.99.146.14) + askari
(100.99.226.39) enrolled, link verified; ubongo agent-management access (sjat key
+ NOPASSWD sudo) recorded. ROADMAP M5: infra done, laptops = operator step,
mesh-hardening split out as the deferred follow-on. FRICTION: docs-only-commit rbw
guard + control-node self-management access gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-17 16:40:02 +02:00

13 KiB

Raw Permalink Blame History

FRICTION.md — kaizen friction log

Raw signals for the periodic kaizen review (/kaizen; see docs/TODO.md 11). This is the input that keeps our tooling and conventions sharpening over time instead of only accreting.

How to use: append freely during work under Open signals — don't curate, don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling that isn't earning its keep. /kaizen reads this, then proposes a verdict per signal (SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased toward remove/park for unused tooling), migrates durable knowledge into the right docs, and moves consumed signals into the decisions ledger below.

Entry format: date — [tag] observation — (optional) → systematization idea Tags: [friction] recurring annoyance · [gotcha] surprising behaviour · [recurring] keeps coming back, should be systematized · [unused] tooling not earning its keep.

Open signals

(append new raw signals here; the next kaizen review consumes them)

[friction] Image push to the Forgejo registry fails with no basic auth credentials (2026-06-15): make caddy-image-push (and molecule-image-push) fail unless the Docker daemon on ubongo has an interactive docker login forgejo.nyumbani.baobab.band session — and those creds are not in vault (only gandi + hetzner are), so an agent can't complete a push non-interactively. The build half is fully automatable; the push half silently requires a human. → candidate: document the docker login step in docs/runbooks/claude-code-setup.md, or store a scoped Forgejo registry token in vault + a make registry-login target (login via --password-stdin, no_log) so pushes are agent-completable like every other vault-backed action.
[gotcha] Single-file Docker bind mount + atomic config rewrite = stale config in the running container (2026-06-16): reverse_proxy bind-mounted the Caddyfile as a single file; ansible.builtin.template writes atomically (temp + rename → new inode), so the running container kept the OLD inode and caddy reload (in-container, no restart) re-read stale config and silently no-op'd ("config is unchanged"). The NetBird route never loaded → Caddy never requested its cert; surfaced only by a TLS handshake failure. Fix: mount the config directory (./caddy → /etc/caddy) — directory mounts reflect inode swaps, so live reload works (proven on askari). NOTE the sibling case: NetBird also single-file-mounts config.yaml, but its handler does docker compose restart (not an in-container reload), and a restart DOES re-resolve the bind mount (verified: 0 before, 1 after) — so restart-based roles are safe; only in-place-reload roles need the dir mount. → candidate gotcha doc (docs/testing/gotchas.md): "reload-in-place needs a directory mount; restart-based roles are fine with a single-file mount."
[friction] make check always fails on the first-ever deploy of a compose service role (2026-06-16): in check mode the "ensure base_dir" task is reported-but-not-run, so the later community.docker.docker_compose_v2 up fails with "…is not a directory" (missing project_src). Not a defect — a real deploy creates the dir — but it means the CLAUDE.md "always make check before make deploy" step is guaranteed-red for any brand new stateful role, which erodes trust in the check. → candidate: guard the compose-up with not ansible_check_mode (clean "skipped" in dry-run; compose can't be meaningfully dry-run before first deploy anyway), OR document the one-time expected failure. Decide one.
[recurring] Re-asked the operator about settled defaults — push + execution mode (2026-06-17): at the M5 plan handoff I asked (a) whether to push to origin and (b) which execution mode (subagent-driven vs inline) — both already settled: CLAUDE.md says push to origin often (off-machine backup), and TODO 10.5 / the standing agreement is "always subagent-driven" (there's even guard-execution-mode-menu.sh). Same shape as the 5× "execution-mode menu asked AGAIN" ledger entries — but this time the ask was my own free-form prose ("want those pushed now?", "which execution approach?"), which the existing menu-text matcher does NOT catch (it keys on the writing-plans menu's literal text). → the gap is that the guard only matches that literal menu; free-form re-asks slip through. Candidate: widen the Stop-hook matcher to also flag prose re-asks of push-vs-not / subagent-vs-inline, since prose reminders have already failed this many times. Default behaviour: push as backup and proceed subagent-driven without asking.
[friction] A docs-only commit still tripped the rbw-locked pre-commit guard (2026-06-17): committing only docs/superpowers/specs/*.md (no ansible content) was blocked needing the vault password, although the 2026-06-10 kaizen fix scoped the pre-commit ansible-lint hook (always_run: false + files: ansible content) so docs-/config-only commits skip it and need no vault. So either the hook's files: pattern still matches docs/** (or .md), or a blanket pre-commit step needs the vault regardless. → check .pre-commit-config.yaml's files:/exclude: against the spec/plan paths; docs-only commits should not require rbw.
[friction] The agent can't manage ubongo (the control node it runs ON) without the operator granting access (2026-06-17): enrolling ubongo in the mesh needed two manual operator grants because the agent runs as claude (no sudo) but the inventory manages ubongo as sjat: (1) claude's SSH key added to sjat's authorized_keys (Permission denied (publickey) otherwise), then (2) NOPASSWD sudo for sjat (Missing sudo password otherwise). So the "AI-worker control node" (ADR-015) can drive the whole fleet but not itself, unattended. This is the pending ansible-user bootstrap gap (STATUS) biting in practice. → the proper fix is ubongo's bootstrap to a key-trusted, NOPASSWD ansible (or sjat) management identity as part of base/its control-node recipe, so control-node self-management doesn't need ad-hoc operator grants.
[recurring] ADRs claim cross-doc reconciliation they didn't actually perform (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said Traefik; the rename was left half-done across the doc set and the ADR over-claimed its own follow-through. Surfaced only by a full-repo grep Traefik during /review-repo. Same shape as the deferred-decision-goes-stale signal (a decision lands in one place, its promised ripple edits don't). → candidate repo-scan.py check: when an ADR's text asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the old name (or verify the claimed edit landed) — the structural cousin of stale-deferred. (KEEP-OPEN per the 2026-06-14 /kaizen run — it's its own build task.)

Kaizen reviews — decisions ledger

Consumed signals and where their resolution now lives. Newest first.

2026-06-14

First /kaizen run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above — a repo-scan.py check is its own build). Bias-to-remove note: zero PARK/REMOVE — none of the open signals were [unused] tooling; they were all knowledge/gotchas/process, which migrate or archive (knowledge is never deleted).

Signal (first seen)	Verdict	Resolution / where it lives now
Execution-mode menu asked AGAIN — 5× (06-05→06-14)	ALREADY-BUILT	The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is verified firing on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect.
Brainstorming spec-review gate fires despite the standing agreement (06-10)	CHANGE → mechanical	Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu.
Subagent faithfulness self-reports can be wrong (06-10)	ACCEPTED	The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs.
ADR-writing policy unsettled (05-31)	ALREADY-BUILT	ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal.
Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14)	ALREADY-BUILT → RESOLVED 2026-06-15	06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral closed — root cause was version skew (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`.
`apply:{tags}` not propagated by dynamic `include_tasks` (06-14)	SYSTEMATIZE	→ `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`".
Molecule CAN test tag-propagation, via a tagged converge (06-14)	SYSTEMATIZE	→ `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule".
apply=false Molecule + data-pytest gap for API/templating roles (06-14)	SYSTEMATIZE	→ `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call".
`item.values` in a loop sends the dict method, not the key (06-14)	SYSTEMATIZE	→ CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`").
TF child modules need their own `required_providers` (06-14)	SYSTEMATIZE	→ CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`").
ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14)	SYSTEMATIZE	→ `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference.
Gandi rejects RFC-7505 null-MX `0 .` (06-14)	MIGRATE	→ `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain).

2026-06-10

Signal (first seen)	Verdict	Resolution / where it lives now
Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10)	CHANGE → mechanical	Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a mechanical guard, not another note.
Every `git commit` needs `rbw` unlock — recurring (05-30)	CHANGE	Root cause was not the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so docs-/config-only commits skip it and need no vault. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted).
`make test` fails when run non-activated — `ansible-config` not found (06-06)	CHANGE	`Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`.
Molecule image missing from the Forgejo registry (06-06)	already built	`make molecule-image-push` target exists.
Deferred decision goes stale across docs — 3× (06-05)	already built	`scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`.
`make new-role` brace-expansion fails under dash (05-30)	fixed	Explicit paths in the Makefile target.
nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06)	MIGRATE	→ `docs/testing/gotchas.md` (pointer from ADR-008).
hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30)	MIGRATE	→ `docs/runbooks/claude-code-setup.md` "Environment gotchas".
`finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01)	accepted	Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs.

Process note: the 2026-06-10 review was manual (the /retro//kaizen tool wasn't built). The 2026-06-14 block was the first run of /kaizen itself (scripts/friction-scan.py Phase 0 + .claude/commands/kaizen.md); the dogfood both cleared the backlog and validated the command.

13 KiB Raw Permalink Blame History Unescape Escape