boma/docs/FRICTION.md
sjat 941141e270 docs(friction): capture 9 signals from the ADR-025 harness shakedown
UEFI-vs-BIOS boot loop, no-sudo diagnosis gap (-> claude sudo decision), qemu
session-vs-system URI, system-qemu home-traversal, directory-inventory phantom
hosts, jinja trim_blocks render trap, empty apt lists on fresh cloud images,
NAT-gateway firewall allow, and the review-vs-hardware coverage lesson.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 16:30:13 +02:00

217 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# FRICTION.md — kaizen friction log
Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
the input that keeps our tooling and conventions sharpening over time instead of only
accreting.
**How to use:** append freely _during_ work under **Open signals** — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the **decisions ledger** below.
**Entry format:** `date — [tag] observation — (optional) → systematization idea`
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
`[recurring]` keeps coming back, should be systematized · `[unused]` tooling not
earning its keep.
---
## Open signals
_(append new raw signals here; the next kaizen review consumes them)_
<!-- The six below are from the 2026-06-17 mesh-hardening-1/3 incident: applying base's
nftables default-deny + wt0-only sshd to askari (the off-site Docker host that ALSO runs
the NetBird coordinator) took it down on reboot; recovery needed the Hetzner console +
a WAN-SSH break-glass. Spec/plan: docs/superpowers/{specs,plans}/2026-06-17-mesh-hardening-askari-ssh-wt0*. -->
- `[gotcha]` **`base`'s nftables `forward policy drop` breaks Docker hosts on reboot**
(2026-06-17): `base/templates/nftables.conf.j2` sets `chain forward { ... policy drop; }`.
On a Docker host, container traffic is *forwarded* (published-port DNAT → container, and
inter-container over the bridge), so the drop kills it. It worked right after `make
deploy` (Docker's runtime rules coexisted) but after a reboot nftables loaded our
default-deny *before* Docker, breaking WAN→Caddy and Caddy→coordinator → the public
services and the mesh went down. The `docker_host` "`nftables.d` container-forward rules"
that would make this Docker-safe are explicitly **pending** (STATUS.md). → the `base`
firewall (`base__firewall_apply`) must NOT be applied to any Docker host until
`docker_host` ships the container-forward rules; add a guard/check (a Docker host with
`firewall_apply: true` and no container-forward drop-in is a misconfiguration), and the
firewall design (ADR-020) should state the Docker-host dependency explicitly.
- `[gotcha]` **`ip_nonlocal_bind` did NOT beat the sshd boot-race** (2026-06-17): the
mesh-hardening plan bound sshd `ListenAddress` to the `wt0` IP and set
`net.ipv4.ip_nonlocal_bind=1` so sshd could bind the mesh IP before `wt0` exists at
boot. In practice the console still showed sshd *"could not assign the address"* at boot
— so the protection did not work as designed, and because `wt0` never came up (the
coordinator was down), sshd had no listener at all → no SSH path. → the entire
"sshd listens on `wt0` only" premise is unsound without (a) a *verified* boot-race fix
and (b) a guaranteed non-mesh break-glass. Re-investigate why `ip_nonlocal_bind` didn't
help (ordering vs the sysctl drop-in load? the sysctl not applied before sshd start?),
or drop ListenAddress-on-mesh entirely and rely on the host firewall for SSH scoping.
- `[gotcha]` **The coordinator host can't bootstrap the mesh it depends on** (2026-06-17):
`askari` runs the NetBird coordinator AND is a mesh peer. After a reboot its NetBird
agent needs the coordinator (a local container) to be serving to bring up `wt0` — but
the coordinator wasn't healthy, so `wt0` never came up. Circular. Combined with sshd
being `wt0`-only, the host was reachable only via the Hetzner console. → the
coordinator host must keep a **non-mesh management path always** (don't move its SSH onto
`wt0`), or the mesh-hardening must treat the coordinator host as a special case. General
rule: never make a host's only management path depend on a service that host itself
hosts.
- `[gotcha]` **NetBird `netbird-server` FATAL-loops on the geolocation DB download with no
egress** (2026-06-17): on startup the combined `netbird-server:0.72.4` tries to download
the GeoLite2 DB from `pkgs.netbird.io` and treats failure as **FATAL** (crash-loop) — so
any loss of container egress (here: Docker NAT masquerade wiped when `nftables` was
flushed, not re-added by a plain `restart docker`) takes the whole control plane down.
Recovery was `restart docker` (rebuild NAT) → force-recreate the container so it could
download. → for the `netbird_coordinator` role: pre-seed/persist the geo DB in the data
dir (or pin a local copy), or disable the geolocation requirement, so a transient egress
blip can't FATAL the coordinator. Note for the firewall design: container egress (NAT)
is fragile across `nft flush` + reboot.
- `[friction]` **No off-site coordinator backup turned a 2-minute restore into a long live
recovery** (2026-06-17): the NetBird coordinator's stateful store (`/var/lib/netbird`,
encrypted SQLite) has **no off-site backup yet** (ADR-022 `backup` role pending,
flagged in STATUS as the coordinator's deferred backup). During the incident there was a
real fear the unclean reboots had corrupted the store, with no restore path. It turned
out to be a runtime/egress issue, not corruption — but the absence of a backup made the
whole recovery higher-stakes. → prioritise the ADR-022 backup contract for the
`netbird_coordinator` store ahead of the rest of the backup role; a recent off-host copy
would have made "rebuild askari from scratch" a safe option.
- `[friction]` **The plan tested reboot-recovery AFTER removing the break-glass**
(2026-06-17): the mesh-hardening plan's live cutover closed the WAN `:22` (step 5)
*before* the reboot-resilience test (step 7), so the one fallback path was gone exactly
when the reboot exposed the boot-race + Docker-firewall bugs. → sequencing rule for
lockout-risky cutovers: **validate reboot-recovery while the old access path is still
open**, and only retire the break-glass once recovery (incl. a reboot) is proven.
Generalises beyond this milestone — a candidate line in the new-host / hardening runbooks.
<!-- The below are from the 2026-06-18 ADR-025 build: standing up the local-VM integration
harness on ubongo and shaking it down against real KVM (spec/plan in docs/superpowers/). -->
- `[gotcha]` **Debian 13 genericcloud boot-loops under legacy BIOS/SeaBIOS** (2026-06-18):
`virt-install --import` of the genericcloud qcow2 with the default (SeaBIOS) firmware
triple-faults at the real-mode kernel handoff — GRUB loops, no "Decompressing Linux", no
DHCP lease. The symptom (no network) pointed away from the cause (firmware). → boot test
VMs via **UEFI** (`virt-install --boot uefi`; OVMF→efistub).
- `[friction]` **The no-sudo `claude` model blocked diagnosing a failed VM** (2026-06-18):
under ADR-015 `claude` had no sudo, so when the VM wouldn't network there was no way to
introspect it (serial logs are `root:0600`, libguestfs not installed, mounting needs
root). Diagnosis was fully blocked until the operator granted `claude` sudo. → DECISION:
`claude` gets `NOPASSWD:ALL` (reverses ADR-015's "no local sudo"); compensating control
is auditd/Loki attribution (already in ADR-015). Amend ADR-015/ADR-021 + accepted-risks;
codify the sudoers drop-in in Ansible.
- `[gotcha]` **Non-root `virsh`/`virt-install` default to `qemu:///session`** (2026-06-18):
the substrate (NAT net, /dev/kvm) lives on `qemu:///system`. → pin
`LIBVIRT_DEFAULT_URI=qemu:///system` in the driver.
- `[gotcha]` **`qemu:///system` (libvirt-qemu) can't traverse `/home`** (2026-06-18): VM
disk/seed/console under the repo/home failed "Permission denied (search permissions for
/home/claude)". → put per-VM artifacts in a system-readable dir (`/var/lib/boma-integration`,
group libvirt); the inventory (read by ansible as the user) can stay in the repo.
- `[gotcha]` **`ansible-playbook -i <dir>/` parses sibling non-inventory files as INI**
(2026-06-18): pointing `-i` at a run-dir holding a state file + qcow2s made the directory
inventory loader parse the state file as INI → phantom hosts INCLUDING the real `askari`
(with its real vars), breaking the single-host isolation invariant. → point `-i` at the
single `hosts.yml`. Caught by the holistic cross-file review BEFORE any hardware run.
- `[gotcha]` **Jinja `{%- -%}` + ansible `trim_blocks=True` double-strip newlines**
(2026-06-18): a template edit used `{%- -%}`, reviewed by rendering with RAW jinja2
(trim_blocks=False) which looked fine; ansible (trim_blocks=True) then collapsed the
rendered Caddyfile onto single lines → caddy crash-looped on invalid config. → verify
templates with ansible's whitespace (trim_blocks=True), not raw jinja2; prefer plain
`{% %}` at column 0 (the repo's existing style).
- `[gotcha]` **Fresh cloud images have empty apt lists** (2026-06-18): `apt install
nftables` failed "No package matching 'nftables' is available" on a fresh genericcloud
VM whose cloud-init had `package_update: false`. → `package_update: true` AND block on
`cloud-init status --wait` before applying.
- `[gotcha]` **base's default-deny firewall drops SSH to a NAT'd VM unless the gateway is
allowed** (2026-06-18): the driver reaches the VM via the libvirt-NAT gateway
(192.168.150.1). `ct established,related accept` saves the in-flight apply connection,
but a fresh post-reboot SSH is dropped without an explicit allow. → test overlay sets
`base__firewall_control_addr` to the NAT gateway.
- `[recurring]` **Real-hardware shakedown and static review each caught what the other
couldn't** (2026-06-18): the qemu-URI, storage-path, UEFI, apt-list, and caddy-render
bugs ALL surfaced only on a live KVM run; the phantom-host inventory bug surfaced only in
the holistic cross-file review. → for infra this novel, budget for BOTH an adversarial
cross-file review AND a real-hardware run; neither alone would have shipped it working.
---
## Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first.
### 2026-06-17
Second `/kaizen` run. 7 signals triaged; all 7 consumed (0 kept open). Two heavier items
(the `rename-incomplete` scan check and the Forgejo registry-login path) were built by
parallel subagents and verified against the diff. **Bias-to-remove note:** one PARK
(the ubongo self-management gap — out-of-phase, already tracked in STATUS) and zero
REMOVE; the rest accreted (migrate/change). None of the open signals were `[unused]`
*tooling*, so there was nothing to delete — the only reductive move available was parking
the out-of-phase build. **Cadence:** healthy — 3 days after the first run, every signal
02 days old except the one carried over from 2026-06-14; the "recurring ≥3" nudge in
`scripts/friction-scan.py` didn't fire this pass (all recurrence counts were 1), so the
thresholds need no change.
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| ADRs claim cross-doc reconciliation they didn't perform (06-14) | SYSTEMATIZE | New `rename-incomplete` check in `scripts/repo-scan.py` (+7 tests): when a numbered ADR announces a rename `Old`→`New`, flag any design-doc line where `Old` still appears in present tense (skips the announcing ADR, lines also naming `New`, and historical/negation cues; rejects `ADR-NNN` tokens as terms). 0 findings on the current tree — the Traefik→Caddy ripple edits have landed. Structural cousin of `stale-deferred`; run by `/review-repo`. (Was KEEP-OPEN on 2026-06-14 — now built.) |
| Image push to the Forgejo registry needs an interactive `docker login` (06-15) | SYSTEMATIZE → vault | Vault-backed login path so pushes are agent-completable: `vault.forgejo.registry_token` stub (CHANGEME, operator-minted) + `scripts/registry-login.sh` (reads the token, `docker login --password-stdin`, never echoes it) + `make registry-login` + a prereq note in `docs/runbooks/claude-code-setup.md`. Works once the operator fills the token via `make edit-vault`. |
| Single-file bind mount + atomic rewrite = stale config (06-16) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Single-file bind mount + atomic rewrite = stale config (reload-in-place only)": `template` writes a new inode, a single-file bind mount pins the old one, so an in-container reload reads stale config. Mount the config *directory* for reload-in-place roles; restart-based roles are fine with a single-file mount. |
| `make check` always fails on the first-ever deploy of a compose service role (06-16) | CHANGE | `check_mode: false` on the `state: directory` scaffold tasks in `roles/reverse_proxy` + `roles/netbird_coordinator`, so the base dirs exist under `--check` and the rest of the dry-run (templates + compose) evaluates instead of failing on a missing `project_src`. Inert under converge → Molecule unchanged. |
| Re-asked settled defaults — push + execution mode, in prose (06-17) | CHANGE (exec) + ACCEPTED (push) | Widened `.claude/hooks/guard-execution-mode-menu.sh` to also catch free-form *prose* re-asks of the subagent-vs-inline choice (`"which execution approach?"`, `"subagent vs inline"`, …), not just the literal menu; tested. The push re-ask stays a soft default via the `dont-reask-settled-defaults` memory — a genuine "should I push?" is sometimes legitimate, so it is deliberately not hard-blocked. |
| Docs-only commit tripped the rbw-locked pre-commit guard (06-17) | CHANGE | Root cause was NOT the ansible-lint `files:` scope (innocent) — it was `.claude/hooks/guard-vault-preflight.sh` blocking *every* locked `git commit`. Rewrote it to inspect the staged set (`git diff --cached`, plus `-a`/`--all`) and block only when Ansible content (`^(roles\|playbooks\|inventories)/.*\.ya?ml$`) is staged; docs-/config-only commits are now exempt. Fail-safe to block when unsure. Tested. |
| Agent can't self-manage `ubongo` (the control node it runs on) without operator grants (06-17) | PARK | The knowledge already lives in `STATUS.md` (control-node row: the interim `claude`-key + `sjat` NOPASSWD grants, and **Pending:** the proper `ansible`-user bootstrap) and the `ubongo-self-sufficiency` memory. Out-of-phase — the fix is the control-node bootstrap recipe, a tracked future build. **Resurrection trigger:** when building ubongo's `base` hardening / `ansible`-user bootstrap, fold in key-trusted NOPASSWD self-management so control-node self-management needs no ad-hoc operator grants. |
### 2026-06-14
First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT → **RESOLVED 2026-06-15** | 06-14: ADR-024 recorded the HTTP-01 decision + DNS-01 deferral. 06-15: deferral **closed** — root cause was **version skew** (pre-Bearer `libdns/gandi` sent Gandi's deprecated `Apikey` header → 403) plus building on a Hetzner IP. Fix: pin caddy-dns/gandi v1.1.0 (Bearer PAT) + build on ubongo. DNS-01 now built + proven (real wildcard cert via LE staging). See ADR-024 Status + STATUS.md + `roles/reverse_proxy`. |
| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |
### 2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
built). The 2026-06-14 block was the **first run of `/kaizen`** itself
(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
cleared the backlog and validated the command.