boma/docs/FRICTION.md
sjat f821006e9e docs(friction): log 2026-06-14 review+follow-up signals
Three new Open signals: ansible-lint no-role-prefix vs ADR-021/022 access__/
backup__ conventions (first service role); Molecule tag-propagation now testable
via tagged converge + full-then-partial; ADRs over-claiming cross-doc reconciliation
(repo-scan check candidate, cousin of stale-deferred).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 20:28:15 +02:00

177 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# FRICTION.md — kaizen friction log
Raw signals for the periodic **kaizen review** (the methodology retrospective; see
`docs/TODO.md`). This is the input that keeps our tooling and conventions sharpening
over time instead of only accreting.
**How to use:** append freely _during_ work under **Open signals** — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. The kaizen review reads this, then proposes
**add / change / remove** (biased toward _remove_), migrates durable knowledge into the
right docs, and moves consumed signals into the **decisions ledger** below.
**Entry format:** `date — [tag] observation — (optional) → systematization idea`
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
`[recurring]` keeps coming back, should be systematized · `[unused]` tooling not
earning its keep.
---
## Open signals
_(append new raw signals here; the next kaizen review consumes them)_
- `[gotcha]` **Hetzner IPs are 403'd by Google's Go module infra; caddy-dns/gandi DNS-01
didn't issue** (2026-06-14, M4a): building the custom Caddy image *on askari* failed —
`proxy.golang.org` and `golang.org` both return **403 Forbidden** to the Hetzner IP
(worked on ubongo). Reworked the role to build on the control node + `docker save`/`load`
to the target. *Then* the `caddy-dns/gandi` DNS-01 plugin would not create the
`_acme-challenge` TXT despite a token verified to (a) be in Caddy's env and (b) create
TXT records via the Gandi API directly — no plugin error, just "propagation timeout,
last error <nil>"; resolvers/timeout tuning didn't help. **Resolution:** askari is a
*public* host, so switched it to **HTTP-01 + vanilla Caddy** (works, drops the custom
image entirely). DNS-01 deferred to Phase 2 (cluster's mesh/LAN-only services) — the
plugin + the Hetzner-build-block to be solved then. → lesson: prefer HTTP-01 wherever a
host is publicly reachable; reserve DNS-01 (and its plugin/build complexity) for hosts
that genuinely can't do HTTP-01. Both bugs surfaced only on the live host.
- `[gotcha]` **A tag on `include_tasks` does NOT reach the included tasks — need
`apply: {tags:}`** (2026-06-14): M3's `base/tasks/main.yml` tagged the ssh/fail2ban
`include_tasks` with `hardening`, but `make deploy … TAGS=hardening` ran *nothing*
(`ok=3 changed=0`) — a tag on a dynamic include selects the include, not its contents.
Fix: `include_tasks: {file: x.yml, apply: {tags: [hardening]}}`. The same latent bug sat
in the firewall include (never hit — firewall was only ever run untagged). Also the
check-mode artifact: a `service`/handler for a not-yet-installed package fails in a
first-run `--check` → guard with `when: not ansible_check_mode`. Both caught only by the
**live `make check`/`deploy` on askari** — Molecule converges *untagged*, so it can't
catch tag-propagation. 3rd reinforcement (after M1 `item.values`, M2 TF
`required_providers`) that live execution catches what review + container tests miss.
→ when a role uses tags to apply concern-subsets, `apply:` is mandatory on its includes;
consider an ansible-lint/CI check that `make deploy … TAGS=<concern>` actually changes things.
- `[gotcha]` **Terraform child modules need their own `required_providers` for
non-hashicorp providers** (2026-06-14): `terraform init` for the `offsite` env failed —
the `hetzner_vm` module used `hcloud_*` resources with no `required_providers` block, so
TF inferred `hashicorp/hcloud` (nonexistent). The `proxmox_vm` module had the **identical
latent bug**, never caught because Proxmox TF was never `init`ed. Both the terraform-MCP
schema check and the final review subagent missed it; only `make tf-init/plan` on ubongo
caught it. Reinforces the M1 signal that **live/real execution catches what static review
can't** — now for Terraform. → always give a TF module its own `versions.tf` with
`required_providers`; treat "reviewed but never run" as a structural blind spot.
- `[gotcha]` **`item.values` in a loop sends the dict's `.values()` METHOD, not the
key** (2026-06-14): the `public_dns` role looped over records that have a `values:`
key and used `{{ item.values }}` in the `gandi_livedns` task. Jinja attribute access
resolved `item.values` to the built-in dict method, so Gandi received
`"<built-in method values of dict object at 0x...>"` as the live TXT value — corrupt
**and** non-idempotent (the address changes each run → always "changed"). The fix is
bracket-indexing: `item['values']` (same risk for any key named `keys`/`items`/`get`/
`update`/...). → convention: in loops, index loop-var keys with `item['key']`, never
`item.key`; consider an ansible-lint guard.
- `[gotcha]` **Gandi LiveDNS rejects RFC-7505 null-MX `0 .`** (2026-06-14): "invalid
format for MX record." Used "no MX + no apex A" + SPF `-all` + DMARC reject instead.
Minor, but worth a note for any future no-mail domain on Gandi.
- `[recurring]` **apply=false Molecule + data-only pytest leave a real gap for
API/templating roles** (2026-06-14): both the null-MX and the `item.values` bugs sailed
through the spec, BOTH review subagents, the pytest (validates the data file, not the
rendered template), and the Molecule scenario (`apply=false`, so the API tasks never
run) — only the **live `make check`/`deploy`** against the real Gandi API surfaced them.
For roles whose payload is "render data → external API call", the rendered template is
the thing that breaks, and nothing short of a real (or check-mode) API call exercises it.
→ for such roles, treat a check-mode run against the real API as a required gate, not an
optional final step; or build a render-only assertion that materializes the module args.
- `[recurring]` **Execution-mode menu asked AGAIN despite the 2026-06-10 "mechanical
fix"** (2026-06-14): at the M1 (`public_dns`) plan handoff I presented the "1.
Subagent-Driven / 2. Inline Execution — which approach?" menu and asked the user to
pick. The decisions ledger (2026-06-10) records this exact behaviour as CHANGE →
mechanical: *"Stop hook in `.claude/settings.json` blocks the turn if the menu appears
and tells me to proceed subagent-driven."* It did not fire — either the hook is absent
in this clone, its matcher doesn't match the wording the `writing-plans` skill actually
produces, or it isn't installed/active. The standing agreement is to **default straight
to subagent-driven without asking**. → verify the Stop hook exists and that its pattern
matches the real menu text (the skill scripts "Two execution options" / "Which
approach?"); if it relies on `.claude/settings.json` hooks that aren't active here,
that's the gap. 5th occurrence (06-05/06/09/10/14).
- `[friction]` **ADR-writing policy is unsettled** (2026-05-31): drafting an ADR, I
invented a Status header ("Proposed") on the fly because there's no documented
convention for how we write ADRs (status lifecycle, required sections). → TODO 10.2 —
decide a minimal ADR template / status convention.
- `[recurring]` **Brainstorming's "user reviews spec" gate fires despite a standing
agreement to skip it** (2026-06-10): writing the ADR-structure spec, I stopped to ask
the user to review the finished spec before writing the plan — the
`superpowers:brainstorming` skill scripts that gate. We had previously agreed I should
move directly from the Q/A to the implementation plan once the spec is written. Same
shape as the execution-mode-menu signal: an external skill's script conflicting with a
boma convention, where prose reminders don't hold. → consider a mechanical guard
(Stop-hook family) or a CLAUDE.md/skill-override note that suppresses the spec-review
gate.
- `[recurring]` **Subagent faithfulness self-reports can be wrong — controller must
diff** (2026-06-10): during the ADR-023 retroactive restructure, an implementer
subagent reported "0 substantive deletions, the See-also lines reappear verbatim" for
ADR-014, but it had actually dropped the cross-reference lines. Caught only by the
controller independently running `git show <sha> | grep '^-[^-]'`. For
faithfulness-critical edits delegated to subagents, the agent's own audit is not
sufficient evidence. → systematize a controller-side deletion-audit step (every `-`
line must be a classified, expected change) before accepting any "presentational-only"
restructure; consider a helper script.
- `[friction]` **ansible-lint `var-naming[no-role-prefix]` rejects the ADR-021/022
`access__*`/`backup__*` cross-role field names** (2026-06-14): building the first
service role's records (`reverse_proxy`), adding the ADR-mandated `access__*` /
`backup__*` data to `defaults/main.yml` failed lint — the rule requires every role var
to start with `<rolename>_`, and ansible-lint 24.x has **no per-prefix allowlist**. The
double-underscore `reverse_proxy__*` namespace passes (starts with `reverse_proxy_`),
but the deliberately shared `access__`/`backup__` names don't. Resolved with inline
`# noqa: var-naming[no-role-prefix]` per var (keeps the rule enforced elsewhere). This
**will recur in every service role**. → decide a project-wide policy before the next
service role: a documented `.ansible-lint` stance, a sanctioned noqa snippet baked into
the `make new-role` scaffold, or reconcile the convention. First collision because
`reverse_proxy` is the first built service role.
- `[gotcha]` **Molecule CAN exercise tag-propagation, but only with a tagged converge +
full-then-partial sequencing** (2026-06-14): closing part of the 2026-06-14 `apply:
{tags:}` signal ("Molecule converges untagged, so it can't catch tag-propagation"). Added
a second converge play (`include_role` with `apply: {tags: [config]}` + a fresh user)
and an assertion, then proved the fix with `molecule converge -- --tags config`. Caveat
learned the hard way: a partial-tag run on a **fresh** instance fails on cross-concern
deps (a `config` task needs `git`, installed by the `packages` concern), and untagged
pre_tasks (test-user creation) get filtered out — so the realistic test is **full
converge → partial re-run** (idempotent), and harness pre_tasks need `tags: [always]`.
→ adopt the tagged-converge-play pattern for any role with concern subsets; this is the
CI check the prior signal asked for, in Molecule rather than `make deploy`.
- `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
(2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said
Traefik; the rename was left half-done across the doc set and the ADR over-claimed its
own follow-through. Surfaced only by a full-repo `grep Traefik` during `/review-repo`.
Same shape as the deferred-decision-goes-stale signal (a decision lands in one place,
its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text
asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the
old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`.
---
## Kaizen reviews — decisions ledger
Consumed signals and where their resolution now lives. Newest first.
### 2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked at plan handoff — 4× (06-05/06/09/10) | CHANGE → mechanical | Stop hook in `.claude/settings.json` blocks the turn if the menu appears and tells me to proceed subagent-driven. Prose reminders (CLAUDE.md, memory, 3 FRICTION entries) had failed four times — the lesson is that a behaviour conflicting with an external skill's script needs a *mechanical* guard, not another note. |
| Every `git commit` needs `rbw` unlock — recurring (05-30) | CHANGE | Root cause was **not** the vault syntax-check (`.ansible-lint` already excludes `vault.yml`); it was ansible-lint auto-loading + decrypting `inventories/production/group_vars/all/vault.yml` via the wired `vault_password_file`. Scoped the pre-commit `ansible-lint` hook (`always_run: false` + `files:` ansible content) so **docs-/config-only commits skip it and need no vault**. Ansible-content commits still need `rbw` (intrinsic to linting vault-backed plays; accepted). |
| `make test` fails when run non-activated — `ansible-config` not found (06-06) | CHANGE | `Makefile` `test`/`test-all` now prepend `$(CURDIR)/.venv/bin` to `PATH`. |
| Molecule image missing from the Forgejo registry (06-06) | already built | `make molecule-image-push` target exists. |
| Deferred decision goes stale across docs — 3× (06-05) | already built | `scripts/repo-scan.py` `open-deferred-item` / `stale-deferred` checks, run by `/review-repo`. |
| `make new-role` brace-expansion fails under dash (05-30) | fixed | Explicit paths in the Makefile target. |
| nft `iif` vs `iifname`, Molecule `ansible_host`, apply-path coverage blind spot, render-`nft -c` pattern (06-06) | MIGRATE | → `docs/testing/gotchas.md` (pointer from ADR-008). |
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
**Process note:** the `/retro` tool (TODO 11) still isn't built, so this review was
manual. Curating by hand (migrate durable knowledge → docs, archive consumed signals →
this ledger) worked well; fold that curation step into `/retro` when it's built.