# FRICTION.md — kaizen friction log

Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
the input that keeps our tooling and conventions sharpening over time instead of only
accreting.

**How to use:** append freely _during_ work under **Open signals** — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the **decisions ledger** below.

**Entry format:** `date — [tag] observation — (optional) → systematization idea`
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
`[recurring]` keeps coming back, should be systematized · `[unused]` tooling not
earning its keep.

---

## Open signals

_(append new raw signals here; the next kaizen review consumes them)_

<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +
a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*. -->

- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot**
  (2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`.
  On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and
  inter-container over the bridge), so the drop kills it. It worked right after `make
  deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our
  default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public
  services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules"
  that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base`
  firewall (`base__firewall_apply`) must NOT be applied to any Docker host until
  `docker_host` ships the container-forward rules; add a guard/check (a Docker host with
  `firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the
  firewall design (ADR-020) should state the Docker-host dependency explicitly.

- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the
  mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set
  `net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at
  boot. In practice the console still showed sshd *"could not assign the address"* at boot
  — so the protection did not work as designed, and because `wt0` never came up (the
  coordinator was down), sshd had no listener at all → no SSH path. → the entire
  "sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix
  and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't
  help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?),
  or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.

- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17):
  `askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird
  agent needs the coordinator (a local container) to be serving to bring up `wt0` — but
  the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd
  being `wt0`-only, the host was reachable only via the Hetzner console. → the
  coordinator host must keep a **non-mesh management path always** (don't move its SSH onto
  `wt0`), or the mesh-hardening must treat the coordinator host as a special case. General
  rule: never make a host's only management path depend on a service that host itself
  hosts.

- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no
  egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download
  the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so
  any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was
  flushed, not re-added by a plain `restart docker`) takes the whole control plane down.
  Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could
  download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data
  dir (or pin a local copy), or disable the geolocation requirement, so a transient egress
  blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT)
  is fragile across `nft flush` + reboot.

- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live
  recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`,
  encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending,
  flagged in STATUS as the coordinator's deferred backup). During the incident there was a
  real fear the unclean reboots had corrupted the store, with no restore path. It turned
  out to be a runtime/egress issue, not corruption — but the absence of a backup made the
  whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the
  `netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy
  would have made "rebuild askari from scratch" a safe option.

- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass**
  (2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5)
  *before* the reboot-resilience test (step 7), so the one fallback path was gone exactly
  when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for
  lockout-risky cutovers: **validate reboot-recovery while the old access path is still
  open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
  Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.

---

## Kaizen reviews — decisions ledger

Consumed signals and where their resolution now lives. Newest first.

### 2026-06-17

Second `/kaizen` run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items
(the `rename-incomplete` scan check and the Forgejo registry-login path) were built by
parallel subagents and verified against the diff. **Bias-to-remove note:** one PARK
(the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero
REMOVE; the rest accreted (migrate/change). None of the open signals were `[unused]`
*tooling*, so there was nothing to delete — the only reductive move available was parking
the out-of-phase build. **Cadence:** healthy — 3 days after the first run, every signal
0–2 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in
`scripts/friction-scan.py` didn't fire this pass (all recurrence counts were 1), so the
thresholds need no change.

| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| ADRs claim cross-doc reconciliation they didn't perform (06-14) | SYSTEMATIZE | New `rename-incomplete` check in `scripts/repo-scan.py` (+7 tests): when a numbered ADR announces a rename `Old`→`New`, flag any design-doc line where `Old` still appears in present tense (skips the announcing ADR, lines also naming `New`, and historical/negation cues; rejects `ADR-NNN` tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of `stale-deferred`; run by `/review-repo`. (Was KEEP-OPEN on 2026-06-14 — now built.) |
| Image push to the Forgejo registry needs an interactive `docker login` (06-15) | SYSTEMATIZE → vault | Vault-backed login path so pushes are agent-completable: `vault.forgejo.registry_token` stub (CHANGEME, operator-minted) + `scripts/registry-login.sh` (reads the token, `docker login --password-stdin`, never echoes it) + `make registry-login` + a prereq note in `docs/runbooks/claude-code-setup.md`. Works once the operator fills the token via `make edit-vault`. |
| Single-file bind mount + atomic rewrite = stale config (06-16) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": `template` writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config *directory* for reload-in-place roles; restart-based roles are fine with a single-file mount. |
| `make check` always fails on the first-ever deploy of a compose service role (06-16) | CHANGE | `check_mode: false` on the `state: directory` scaffold tasks in `roles/reverse_proxy` + `roles/netbird_coordinator`, so the base dirs exist under `--check` and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing `project_src`. Inert under converge → Molecule unchanged. |
| Re-asked settled defaults — push + execution mode, in prose (06-17) | CHANGE (exec) + ACCEPTED (push) | Widened `.claude/hooks/guard-execution-mode-menu.sh` to also catch free-form *prose* re-asks of the subagent-vs-inline choice (`"which execution approach?"`, `"subagent vs inline"`, …), not just the literal menu; tested. The push re-ask stays a soft default via the `dont-reask-settled-defaults` memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked. |
| Docs-only commit tripped the rbw-locked pre-commit guard (06-17) | CHANGE | Root cause was NOT the ansible-lint `files:` scope (innocent) — it was `.claude/hooks/guard-vault-preflight.sh` blocking *every* locked `git commit`. Rewrote it to inspect the staged set (`git diff --cached`, plus `-a`/`--all`) and block only when Ansible content (`^(roles\|playbooks\|inventories)/.*\.ya?ml$`) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested. |
| Agent can't self-manage `ubongo` (the control node it runs on) without operator grants (06-17) | PARK | The knowledge already lives in `STATUS.md` (control-node row: the interim `claude`-key + `sjat` NOPASSWD grants, and **Pending:** the proper `ansible`-user bootstrap) and the `ubongo-self-sufficiency` memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. **Resurrection trigger:** when building ubongo's `base` hardening / `ansible`-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants. |

### 2026-06-14

First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).

| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. |
| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |

### 2026-06-10

| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |

**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
built). The 2026-06-14 block was the **first run of `/kaizen`** itself
(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
cleared the backlog and validated the command.