docs(kaizen): bind-mount gotcha + consume 7 signals into the ledger (2026-06-17)

Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a
directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move
all 7 open signals out of FRICTION.md's Open-signals section into the new
2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo
self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to
accept an empty Open-signals section (the goal state after a kaizen pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-17 17:50:17 +02:00
parent c1323a3f29
commit a0762c563e
3 changed files with 45 additions and 79 deletions

View file

@ -22,90 +22,35 @@ earning its keep.
_(append new raw signals here; the next kaizen review consumes them)_ _(append new raw signals here; the next kaizen review consumes them)_
- `[friction]` **Image push to the Forgejo registry fails with `no basic auth
credentials`** (2026-06-15): `make caddy-image-push` (and `molecule-image-push`) fail
unless the Docker daemon on ubongo has an interactive `docker login
forgejo.nyumbani.baobab.band` session — and those creds are **not in vault** (only
`gandi` + `hetzner` are), so an agent can't complete a push non-interactively. The
build half is fully automatable; the push half silently requires a human. → candidate:
document the `docker login` step in `docs/runbooks/claude-code-setup.md`, **or** store
a scoped Forgejo registry token in vault + a `make registry-login` target (login via
`--password-stdin`, `no_log`) so pushes are agent-completable like every other
vault-backed action.
- `[gotcha]` **Single-file Docker bind mount + atomic config rewrite = stale config in
the running container** (2026-06-16): `reverse_proxy` bind-mounted the Caddyfile as a
single file; `ansible.builtin.template` writes atomically (temp + rename → new inode),
so the running container kept the OLD inode and `caddy reload` (in-container, no restart)
re-read stale config and silently no-op'd (`"config is unchanged"`). The NetBird route
never loaded → Caddy never requested its cert; surfaced only by a TLS handshake failure.
Fix: mount the config **directory** (`./caddy``/etc/caddy`) — directory mounts reflect
inode swaps, so live reload works (proven on askari). NOTE the sibling case: NetBird also
single-file-mounts `config.yaml`, but its handler does `docker compose restart` (not an
in-container reload), and a restart DOES re-resolve the bind mount (verified: 0 before,
1 after) — so restart-based roles are safe; only in-place-reload roles need the dir mount.
→ candidate gotcha doc (`docs/testing/gotchas.md`): "reload-in-place needs a directory
mount; restart-based roles are fine with a single-file mount."
- `[friction]` **`make check` always fails on the first-ever deploy of a compose service
role** (2026-06-16): in check mode the "ensure base_dir" task is reported-but-not-run, so
the later `community.docker.docker_compose_v2` up fails with `"…is not a directory"`
(missing `project_src`). Not a defect — a real deploy creates the dir — but it means the
CLAUDE.md "always `make check` before `make deploy`" step is guaranteed-red for any brand
new stateful role, which erodes trust in the check. → candidate: guard the compose-up with
`not ansible_check_mode` (clean "skipped" in dry-run; compose can't be meaningfully
dry-run before first deploy anyway), OR document the one-time expected failure. Decide one.
- `[recurring]` **Re-asked the operator about settled defaults — push + execution mode**
(2026-06-17): at the M5 plan handoff I asked (a) whether to push to origin and (b) which
execution mode (subagent-driven vs inline) — both already settled: CLAUDE.md says push to
`origin` often (off-machine backup), and TODO 10.5 / the standing agreement is "always
subagent-driven" (there's even `guard-execution-mode-menu.sh`). Same shape as the 5×
"execution-mode menu asked AGAIN" ledger entries — but this time the ask was my own
free-form prose ("want those pushed now?", "which execution approach?"), which the
existing menu-text matcher does NOT catch (it keys on the writing-plans menu's literal
text). → the gap is that the guard only matches that literal menu; free-form re-asks slip
through. Candidate: widen the Stop-hook matcher to also flag prose re-asks of
push-vs-not / subagent-vs-inline, since prose reminders have already failed this many
times. Default behaviour: **push as backup and proceed subagent-driven without asking.**
- `[friction]` **A docs-only commit still tripped the `rbw`-locked pre-commit guard**
(2026-06-17): committing only `docs/superpowers/specs/*.md` (no ansible content) was
blocked needing the vault password, although the 2026-06-10 kaizen fix scoped the
pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so
docs-/config-only commits skip it and need no vault. So either the hook's `files:`
pattern still matches `docs/**` (or `.md`), or a blanket pre-commit step needs the
vault regardless. → check `.pre-commit-config.yaml`'s `files:`/`exclude:` against the
spec/plan paths; docs-only commits should not require `rbw`.
- `[friction]` **The agent can't manage `ubongo` (the control node it runs ON) without
the operator granting access** (2026-06-17): enrolling `ubongo` in the mesh needed two
manual operator grants because the agent runs as `claude` (no sudo) but the inventory
manages `ubongo` as `sjat`: (1) `claude`'s SSH key added to `sjat`'s `authorized_keys`
(`Permission denied (publickey)` otherwise), then (2) `NOPASSWD` sudo for `sjat`
(`Missing sudo password` otherwise). So the "AI-worker control node" (ADR-015) can drive
the whole fleet but not itself, unattended. This is the **pending `ansible`-user
bootstrap** gap (STATUS) biting in practice. → the proper fix is ubongo's bootstrap to a
key-trusted, NOPASSWD `ansible` (or `sjat`) management identity as part of `base`/its
control-node recipe, so control-node self-management doesn't need ad-hoc operator grants.
- `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
(2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said
Traefik; the rename was left half-done across the doc set and the ADR over-claimed its
own follow-through. Surfaced only by a full-repo `grep Traefik` during `/review-repo`.
Same shape as the deferred-decision-goes-stale signal (a decision lands in one place,
its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text
asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the
old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`.
(KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.)
--- ---
## Kaizen reviews — decisions ledger ## Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first. Consumed signals and where their resolution now lives. Newest first.
### 2026-06-17
Second `/kaizen` run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items
(the `rename-incomplete` scan check and the Forgejo registry-login path) were built by
parallel subagents and verified against the diff. **Bias-to-remove note:** one PARK
(the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero
REMOVE; the rest accreted (migrate/change). None of the open signals were `[unused]`
*tooling*, so there was nothing to delete — the only reductive move available was parking
the out-of-phase build. **Cadence:** healthy — 3 days after the first run, every signal
02 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in
`scripts/friction-scan.py` didn't fire this pass (all recurrence counts were 1), so the
thresholds need no change.
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| ADRs claim cross-doc reconciliation they didn't perform (06-14) | SYSTEMATIZE | New `rename-incomplete` check in `scripts/repo-scan.py` (+7 tests): when a numbered ADR announces a rename `Old``New`, flag any design-doc line where `Old` still appears in present tense (skips the announcing ADR, lines also naming `New`, and historical/negation cues; rejects `ADR-NNN` tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of `stale-deferred`; run by `/review-repo`. (Was KEEP-OPEN on 2026-06-14 — now built.) |
| Image push to the Forgejo registry needs an interactive `docker login` (06-15) | SYSTEMATIZE → vault | Vault-backed login path so pushes are agent-completable: `vault.forgejo.registry_token` stub (CHANGEME, operator-minted) + `scripts/registry-login.sh` (reads the token, `docker login --password-stdin`, never echoes it) + `make registry-login` + a prereq note in `docs/runbooks/claude-code-setup.md`. Works once the operator fills the token via `make edit-vault`. |
| Single-file bind mount + atomic rewrite = stale config (06-16) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": `template` writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config *directory* for reload-in-place roles; restart-based roles are fine with a single-file mount. |
| `make check` always fails on the first-ever deploy of a compose service role (06-16) | CHANGE | `check_mode: false` on the `state: directory` scaffold tasks in `roles/reverse_proxy` + `roles/netbird_coordinator`, so the base dirs exist under `--check` and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing `project_src`. Inert under converge → Molecule unchanged. |
| Re-asked settled defaults — push + execution mode, in prose (06-17) | CHANGE (exec) + ACCEPTED (push) | Widened `.claude/hooks/guard-execution-mode-menu.sh` to also catch free-form *prose* re-asks of the subagent-vs-inline choice (`"which execution approach?"`, `"subagent vs inline"`, …), not just the literal menu; tested. The push re-ask stays a soft default via the `dont-reask-settled-defaults` memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked. |
| Docs-only commit tripped the rbw-locked pre-commit guard (06-17) | CHANGE | Root cause was NOT the ansible-lint `files:` scope (innocent) — it was `.claude/hooks/guard-vault-preflight.sh` blocking *every* locked `git commit`. Rewrote it to inspect the staged set (`git diff --cached`, plus `-a`/`--all`) and block only when Ansible content (`^(roles\|playbooks\|inventories)/.*\.ya?ml$`) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested. |
| Agent can't self-manage `ubongo` (the control node it runs on) without operator grants (06-17) | PARK | The knowledge already lives in `STATUS.md` (control-node row: the interim `claude`-key + `sjat` NOPASSWD grants, and **Pending:** the proper `ansible`-user bootstrap) and the `ubongo-self-sufficiency` memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. **Resurrection trigger:** when building ubongo's `base` hardening / `ansible`-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants. |
### 2026-06-14 ### 2026-06-14
First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above — First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —

View file

@ -70,3 +70,21 @@ testing surprise is worth remembering past the session that hit it.
plus review. Only a real (or `--check`) call against the API surfaces them. plus review. Only a real (or `--check`) call against the API surfaces them.
- → Treat a **check-mode run against the real API as a required gate** for such roles, or - → Treat a **check-mode run against the real API as a required gate** for such roles, or
build a render-only assertion that materializes and inspects the rendered module args. build a render-only assertion that materializes and inspects the rendered module args.
## Single-file bind mount + atomic rewrite = stale config (reload-in-place only)
- **`ansible.builtin.template` writes atomically** (temp file + rename → a *new inode*). A
Docker **single-file** bind mount pins the *old* inode, so a container that reloads
config **in place** (no restart) keeps reading the stale file. Live hit: `reverse_proxy`
bind-mounted the Caddyfile as a single file; `caddy reload` (in-container) re-read the
old inode and silently no-op'd (`"config is unchanged"`). The new NetBird route never
loaded → Caddy never requested its cert → surfaced only as a downstream TLS handshake
failure.
- **Fix for reload-in-place roles: bind-mount the config *directory*, not the file**
(`./caddy``/etc/caddy`). Directory mounts reflect the inode swap, so the reload sees
the new file (proven on askari).
- **Restart-based roles are fine with a single-file mount.** Sibling case: `netbird`
single-file-mounts `config.yaml`, but its handler does `docker compose restart` (not an
in-container reload), and a **restart re-resolves the bind mount** (verified: route
count 0 before, 1 after). Rule of thumb: **reload-in-place needs a directory mount;
restart-based roles don't.**

View file

@ -123,5 +123,8 @@ def test_nudge_line_overdue_on_age():
def test_load_signals_reads_real_friction_file(): def test_load_signals_reads_real_friction_file():
path = os.path.join(os.path.dirname(__file__), "..", "docs", "FRICTION.md") path = os.path.join(os.path.dirname(__file__), "..", "docs", "FRICTION.md")
sigs = fs.load_signals(path, TODAY) sigs = fs.load_signals(path, TODAY)
assert len(sigs) >= 1 # May legitimately be empty right after a /kaizen pass consumes every open signal —
# an empty Open-signals section is the goal state, not a failure. Assert the function
# parses the real file into well-formed signals (validity holds vacuously when empty).
assert isinstance(sigs, list)
assert all(s["tag"] in {"friction", "gotcha", "recurring", "unused"} for s in sigs) assert all(s["tag"] in {"friction", "gotcha", "recurring", "unused"} for s in sigs)