chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open. - SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule tag-isolation testing, API/templating render-only gap); CLAUDE.md (item['key'] loop convention, TF module required_providers); public_dns README (Gandi null-MX workaround). - CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate (verified: blocks the gate, passes meta-discussion). - SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder; ADR-004 documents the cross-role-naming convention. - ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024; subagent-faithfulness now embodied in the two-stage subagent review. - KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation. Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 21:46:23 +02:00 · 2026-06-14 21:46:23 +02:00 · 13ae674cc9
commit 13ae674cc9
parent d1e1e38879
7 changed files with 120 additions and 141 deletions
--- a/.claude/hooks/guard-execution-mode-menu.sh
+++ b/.claude/hooks/guard-execution-mode-menu.sh
@ -1,17 +1,20 @@
 #!/usr/bin/env bash
 #
-# Stop guard: block ending the turn when the assistant's final message presents the
-# execution-mode menu. The writing-plans / subagent-driven-development skills script a
-# "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution
-# handoff. boma's standing preference (docs/FRICTION.md + the
-# always-subagent-driven-execution memory) is to NEVER present it and proceed
-# subagent-driven. Prose reminders failed four times (06-05/06/09/10); this is the
-# mechanical guard recorded by the 2026-06-10 kaizen review.
+# Stop guard for two external-skill gates that conflict with boma conventions, where
+# prose reminders repeatedly failed to hold (docs/FRICTION.md):
+#
+#   1. The execution-mode menu — writing-plans / subagent-driven-development script a
+#      "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution
+#      handoff. boma's standing preference is to NEVER present it and proceed
+#      subagent-driven. (Recorded by the 2026-06-10 kaizen review.)
+#   2. The brainstorming spec-review gate — the brainstorming skill scripts "Spec written
+#      and committed … please review it before … the implementation plan." The standing
+#      agreement is to move directly from the committed spec to writing-plans. (Recorded
+#      by the 2026-06-14 kaizen review; 06-10/06-14 recurrences.)
 #
 # Fails OPEN: any parse/read problem → allow the stop. Respects stop_hook_active so a
-# block can never loop. The match signature is deliberately tight ("inline execution"
-# AND "which approach"/"two execution options") so it fires on the actual menu, not on
-# meta-discussion of it.
+# block can never loop. Match signatures are deliberately tight so they fire on the
+# actual gate text, not on meta-discussion of it.
 #
 set -uo pipefail

@ -43,4 +46,16 @@ JSON
  exit 0
 fi

+# Brainstorming spec-review gate: asking the user to review the committed spec before
+# the implementation plan. Tight signature: "implementation plan" present, plus either the
+# skill's literal "spec written and committed" line, or the review+spec+before combination.
+if [[ "$low" == *"implementation plan"* \
+   && ( "$low" == *"spec written and committed"* \
+        || ( "$low" == *"review"* && "$low" == *"the spec"* && "$low" == *"before"* ) ) ]]; then
+  cat <<'JSON'
+{"decision":"block","reason":"Brainstorming spec-review gate detected in your final message. boma standing agreement (docs/FRICTION.md): once the spec is written and committed, move directly to the implementation plan (superpowers:writing-plans) — do not stop to ask the user to review the spec first. Drop the gate and proceed."}
+JSON
+  exit 0
+fi
+
 exit 0
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -62,6 +62,9 @@ Full design rationale: `docs/decisions/`
 - **Variables**: `rolename__varname` double-underscore namespace for role defaults
 - **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only
 - **Loops**: prefer `loop:` over `with_items:`
+- **Loop var keys**: index with `item['key']`, never `item.key` — a key named
+  `values`/`keys`/`items`/`get`/… resolves to the dict *method* (silently corrupt +
+  non-idempotent), not the value
 - **Conditionals**: prefer `true`/`false` over `yes`/`no`

 ---
@ -178,6 +181,9 @@ Single-contributor, trunk-based (no merge requests / approval gates):
 - Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files
 - `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored
 - `.terraform.lock.hcl` is tracked (pins provider versions)
+- Every module declares its own `required_providers` (in `versions.tf`) for any
+  non-hashicorp provider — otherwise TF infers `hashicorp/<name>` and `init` fails
+  (caught only by a live `tf-init`, not by static review)
 - Full rationale: `docs/decisions/006-terraform.md`

 ---
--- a/9
+++ b/9
@ -181,7 +181,14 @@ endif
 	         roles/$(NAME)/molecule/default
 	echo "---" > roles/$(NAME)/tasks/main.yml
 	echo "---" > roles/$(NAME)/handlers/main.yml
-	echo "---" > roles/$(NAME)/defaults/main.yml
+	printf '%s\n' '---' \
+	  '# Role defaults use the <rolename>__var double-underscore namespace.' \
+	  '#' \
+	  '# Service roles (ADR-004) also declare access__*/backup__* data here. Those are' \
+	  '# cross-role conventions (not rolename-prefixed), so EACH such line needs a trailing' \
+	  '# noqa: var-naming[no-role-prefix]  (ansible-lint 24.x has no per-prefix allowlist).' \
+	  '# Reference: roles/reverse_proxy/defaults/main.yml' \
+	  > roles/$(NAME)/defaults/main.yml
 	echo "---" > roles/$(NAME)/meta/main.yml
 	printf '# %s\n\nRole description here.\n' "$(NAME)" > roles/$(NAME)/README.md
 	cp .scaffold/molecule.yml roles/$(NAME)/molecule/default/molecule.yml
--- a/docs/FRICTION.md
+++ b/docs/FRICTION.md
@ -1,14 +1,15 @@
 # FRICTION.md — kaizen friction log

-Raw signals for the periodic **kaizen review** (the methodology retrospective; see
-`docs/TODO.md`). This is the input that keeps our tooling and conventions sharpening
-over time instead of only accreting.
+Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
+the input that keeps our tooling and conventions sharpening over time instead of only
+accreting.

 **How to use:** append freely _during_ work under **Open signals** — don't curate,
 don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
-that isn't earning its keep. The kaizen review reads this, then proposes
-**add / change / remove** (biased toward _remove_), migrates durable knowledge into the
-right docs, and moves consumed signals into the **decisions ledger** below.
+that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
+(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
+toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
+and moves consumed signals into the **decisions ledger** below.

 **Entry format:** `date — [tag] observation — (optional) → systematization idea`
 Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
@ -21,127 +22,6 @@ earning its keep.

 _(append new raw signals here; the next kaizen review consumes them)_

- `[gotcha]` **Hetzner IPs are 403'd by Google's Go module infra; caddy-dns/gandi DNS-01
-  didn't issue** (2026-06-14, M4a): building the custom Caddy image *on askari* failed —
-  `proxy.golang.org` and `golang.org` both return **403 Forbidden** to the Hetzner IP
-  (worked on ubongo). Reworked the role to build on the control node + `docker save`/`load`
-  to the target. *Then* the `caddy-dns/gandi` DNS-01 plugin would not create the
-  `_acme-challenge` TXT despite a token verified to (a) be in Caddy's env and (b) create
-  TXT records via the Gandi API directly — no plugin error, just "propagation timeout,
-  last error <nil>"; resolvers/timeout tuning didn't help. **Resolution:** askari is a
-  *public* host, so switched it to **HTTP-01 + vanilla Caddy** (works, drops the custom
-  image entirely). DNS-01 deferred to Phase 2 (cluster's mesh/LAN-only services) — the
-  plugin + the Hetzner-build-block to be solved then. → lesson: prefer HTTP-01 wherever a
-  host is publicly reachable; reserve DNS-01 (and its plugin/build complexity) for hosts
-  that genuinely can't do HTTP-01. Both bugs surfaced only on the live host.
-
- `[gotcha]` **A tag on `include_tasks` does NOT reach the included tasks — need
-  `apply: {tags:}`** (2026-06-14): M3's `base/tasks/main.yml` tagged the ssh/fail2ban
-  `include_tasks` with `hardening`, but `make deploy … TAGS=hardening` ran *nothing*
-  (`ok=3 changed=0`) — a tag on a dynamic include selects the include, not its contents.
-  Fix: `include_tasks: {file: x.yml, apply: {tags: [hardening]}}`. The same latent bug sat
-  in the firewall include (never hit — firewall was only ever run untagged). Also the
-  check-mode artifact: a `service`/handler for a not-yet-installed package fails in a
-  first-run `--check` → guard with `when: not ansible_check_mode`. Both caught only by the
-  **live `make check`/`deploy` on askari** — Molecule converges *untagged*, so it can't
-  catch tag-propagation. 3rd reinforcement (after M1 `item.values`, M2 TF
-  `required_providers`) that live execution catches what review + container tests miss.
-  → when a role uses tags to apply concern-subsets, `apply:` is mandatory on its includes;
-  consider an ansible-lint/CI check that `make deploy … TAGS=<concern>` actually changes things.
-
- `[gotcha]` **Terraform child modules need their own `required_providers` for
-  non-hashicorp providers** (2026-06-14): `terraform init` for the `offsite` env failed —
-  the `hetzner_vm` module used `hcloud_*` resources with no `required_providers` block, so
-  TF inferred `hashicorp/hcloud` (nonexistent). The `proxmox_vm` module had the **identical
-  latent bug**, never caught because Proxmox TF was never `init`ed. Both the terraform-MCP
-  schema check and the final review subagent missed it; only `make tf-init/plan` on ubongo
-  caught it. Reinforces the M1 signal that **live/real execution catches what static review
-  can't** — now for Terraform. → always give a TF module its own `versions.tf` with
-  `required_providers`; treat "reviewed but never run" as a structural blind spot.
-
- `[gotcha]` **`item.values` in a loop sends the dict's `.values()` METHOD, not the
-  key** (2026-06-14): the `public_dns` role looped over records that have a `values:`
-  key and used `{{ item.values }}` in the `gandi_livedns` task. Jinja attribute access
-  resolved `item.values` to the built-in dict method, so Gandi received
-  `"<built-in method values of dict object at 0x...>"` as the live TXT value — corrupt
-  **and** non-idempotent (the address changes each run → always "changed"). The fix is
-  bracket-indexing: `item['values']` (same risk for any key named `keys`/`items`/`get`/
-  `update`/...). → convention: in loops, index loop-var keys with `item['key']`, never
-  `item.key`; consider an ansible-lint guard.
- `[gotcha]` **Gandi LiveDNS rejects RFC-7505 null-MX `0 .`** (2026-06-14): "invalid
-  format for MX record." Used "no MX + no apex A" + SPF `-all` + DMARC reject instead.
-  Minor, but worth a note for any future no-mail domain on Gandi.
- `[recurring]` **apply=false Molecule + data-only pytest leave a real gap for
-  API/templating roles** (2026-06-14): both the null-MX and the `item.values` bugs sailed
-  through the spec, BOTH review subagents, the pytest (validates the data file, not the
-  rendered template), and the Molecule scenario (`apply=false`, so the API tasks never
-  run) — only the **live `make check`/`deploy`** against the real Gandi API surfaced them.
-  For roles whose payload is "render data → external API call", the rendered template is
-  the thing that breaks, and nothing short of a real (or check-mode) API call exercises it.
-  → for such roles, treat a check-mode run against the real API as a required gate, not an
-  optional final step; or build a render-only assertion that materializes the module args.
-
- `[recurring]` **Execution-mode menu asked AGAIN despite the 2026-06-10 "mechanical
-  fix"** (2026-06-14): at the M1 (`public_dns`) plan handoff I presented the "1.
-  Subagent-Driven / 2. Inline Execution — which approach?" menu and asked the user to
-  pick. The decisions ledger (2026-06-10) records this exact behaviour as CHANGE →
-  mechanical: *"Stop hook in `.claude/settings.json` blocks the turn if the menu appears
-  and tells me to proceed subagent-driven."* It did not fire — either the hook is absent
-  in this clone, its matcher doesn't match the wording the `writing-plans` skill actually
-  produces, or it isn't installed/active. The standing agreement is to **default straight
-  to subagent-driven without asking**. → verify the Stop hook exists and that its pattern
-  matches the real menu text (the skill scripts "Two execution options" / "Which
-  approach?"); if it relies on `.claude/settings.json` hooks that aren't active here,
-  that's the gap. 5th occurrence (06-05/06/09/10/14).
-
- `[friction]` **ADR-writing policy is unsettled** (2026-05-31): drafting an ADR, I
-  invented a Status header ("Proposed") on the fly because there's no documented
-  convention for how we write ADRs (status lifecycle, required sections). → TODO 10.2 —
-  decide a minimal ADR template / status convention.
- `[recurring]` **Brainstorming's "user reviews spec" gate fires despite a standing
-  agreement to skip it** (2026-06-10): writing the ADR-structure spec, I stopped to ask
-  the user to review the finished spec before writing the plan — the
-  `superpowers:brainstorming` skill scripts that gate. We had previously agreed I should
-  move directly from the Q/A to the implementation plan once the spec is written. Same
-  shape as the execution-mode-menu signal: an external skill's script conflicting with a
-  boma convention, where prose reminders don't hold. → consider a mechanical guard
-  (Stop-hook family) or a CLAUDE.md/skill-override note that suppresses the spec-review
-  gate.
- `[recurring]` **Subagent faithfulness self-reports can be wrong — controller must
-  diff** (2026-06-10): during the ADR-023 retroactive restructure, an implementer
-  subagent reported "0 substantive deletions, the See-also lines reappear verbatim" for
-  ADR-014, but it had actually dropped the cross-reference lines. Caught only by the
-  controller independently running `git show <sha> | grep '^-[^-]'`. For
-  faithfulness-critical edits delegated to subagents, the agent's own audit is not
-  sufficient evidence. → systematize a controller-side deletion-audit step (every `-`
-  line must be a classified, expected change) before accepting any "presentational-only"
-  restructure; consider a helper script.
-
- `[friction]` **ansible-lint `var-naming[no-role-prefix]` rejects the ADR-021/022
-  `access__*`/`backup__*` cross-role field names** (2026-06-14): building the first
-  service role's records (`reverse_proxy`), adding the ADR-mandated `access__*` /
-  `backup__*` data to `defaults/main.yml` failed lint — the rule requires every role var
-  to start with `<rolename>_`, and ansible-lint 24.x has **no per-prefix allowlist**. The
-  double-underscore `reverse_proxy__*` namespace passes (starts with `reverse_proxy_`),
-  but the deliberately shared `access__`/`backup__` names don't. Resolved with inline
-  `# noqa: var-naming[no-role-prefix]` per var (keeps the rule enforced elsewhere). This
-  **will recur in every service role**. → decide a project-wide policy before the next
-  service role: a documented `.ansible-lint` stance, a sanctioned noqa snippet baked into
-  the `make new-role` scaffold, or reconcile the convention. First collision because
-  `reverse_proxy` is the first built service role.
-
- `[gotcha]` **Molecule CAN exercise tag-propagation, but only with a tagged converge +
-  full-then-partial sequencing** (2026-06-14): closing part of the 2026-06-14 `apply:
-  {tags:}` signal ("Molecule converges untagged, so it can't catch tag-propagation"). Added
-  a second converge play (`include_role` with `apply: {tags: [config]}` + a fresh user)
-  and an assertion, then proved the fix with `molecule converge -- --tags config`. Caveat
-  learned the hard way: a partial-tag run on a **fresh** instance fails on cross-concern
-  deps (a `config` task needs `git`, installed by the `packages` concern), and untagged
-  pre_tasks (test-user creation) get filtered out — so the realistic test is **full
-  converge → partial re-run** (idempotent), and harness pre_tasks need `tags: [always]`.
-  → adopt the tagged-converge-play pattern for any role with concern subsets; this is the
-  CI check the prior signal asked for, in Molecule rather than `make deploy`.
-
 - `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
  (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
  Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said
@ -151,6 +31,7 @@ _(append new raw signals here; the next kaizen review consumes them)_
  its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text
  asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the
  old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`.
+  (KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.)

 ---

@ -158,6 +39,28 @@ _(append new raw signals here; the next kaizen review consumes them)_

 Consumed signals and where their resolution now lives. Newest first.

+### 2026-06-14
+
+First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
+a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
+of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
+which migrate or archive (knowledge is never deleted).
+
+| Signal (first seen) | Verdict | Resolution / where it lives now |
+|---|---|---|
+| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
+| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
+| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
+| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
+| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT | ADR-024's revised Status records the HTTP-01 decision, the DNS-01 deferral to Phase 2, and the Hetzner-build + plugin blocks. |
+| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
+| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
+| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
+| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
+| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
+| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
+| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |
+
 ### 2026-06-10

 | Signal (first seen) | Verdict | Resolution / where it lives now |
@ -172,6 +75,7 @@ Consumed signals and where their resolution now lives. Newest first.
 | hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
 | `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |

-**Process note:** the `/retro` tool (TODO 11) still isn't built, so this review was
-manual. Curating by hand (migrate durable knowledge → docs, archive consumed signals →
-this ledger) worked well; fold that curation step into `/retro` when it's built.
+**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
+built). The 2026-06-14 block was the **first run of `/kaizen`** itself
+(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
+cleared the backlog and validated the command.
--- a/docs/decisions/004-docker-model.md
+++ b/docs/decisions/004-docker-model.md
@ -51,6 +51,13 @@ below). Each service role contains a standard set of files:
 | `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
 | `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |

+The `access__*` (ADR-021) and `backup__*` (ADR-022) data in `defaults/main.yml` are
+**cross-role conventions** — shared field names that deliberately do *not* carry the
+`<rolename>__` prefix. ansible-lint's `var-naming[no-role-prefix]` has no per-prefix
+allowlist, so each such line carries a trailing `# noqa: var-naming[no-role-prefix]` (the
+rule stays enforced for genuinely role-scoped vars). `make new-role` scaffolds a reminder;
+`roles/reverse_proxy/defaults/main.yml` is the reference.
+
 ### Standard deploy mechanics

 Every service role's `tasks/main.yml` follows the same sequence, so all roles are
--- a/docs/testing/gotchas.md
+++ b/docs/testing/gotchas.md
@ -34,3 +34,39 @@ testing surprise is worth remembering past the session that hit it.
  apply/safety paths Molecule can't exercise, validate out-of-band (a throwaway
  `--privileged` container with its own netns) and treat a final adversarial review as
  **mandatory, not optional**.
+
+## Tags on dynamic `include_tasks` need `apply:` to reach the included tasks
+
+- **A tag on a dynamic `include_tasks` selects the include statement, not its contents.**
+  Tagging `include_tasks: x.yml` with `concern` and running `--tags concern` runs
+  *nothing* (`ok=N changed=0`) unless the included tasks are independently tagged. Use
+  `include_tasks: {file: x.yml, apply: {tags: [concern]}}` to propagate the tag onto the
+  included tasks — **mandatory** whenever a role uses tags to apply concern-subsets
+  (`roles/base/tasks/main.yml` and `roles/dev_env/tasks/main.yml` are the references).
+- **Molecule converges *untagged*, so it cannot catch this by default** — the bug only
+  shows under `make deploy … TAGS=<concern>` on a real host (first hit live on askari, M3).
+  See the tag-isolation pattern below to catch it in Molecule instead.
+- **Check-mode artifact:** a `service`/handler for a not-yet-installed package fails in a
+  first-run `--check`; guard with `when: not ansible_check_mode`.
+
+## Testing concern-tag isolation in Molecule
+
+- To catch the tag-propagation bug above *in Molecule*, add a **second converge play**
+  that applies one concern to a fresh target — `include_role` with `apply: {tags: [config]}`
+  — plus a `verify` assertion that the concern's effect landed. Drive the real partial
+  path with `molecule converge -- --tags config`.
+- **Sequence matters:** a partial-tag run on a *fresh* instance fails on cross-concern
+  deps (a `config` task may need a binary the `packages` concern installs). The realistic
+  test is **full converge → partial `--tags` re-run** (idempotent). Harness `pre_tasks`
+  (e.g. test-user creation) must be tagged `always`, or `--tags` filters them out.
+  (Pattern proven on `dev_env`, 2026-06-14.)
+
+## API / templating roles: render-only tests miss the real call
+
+- For a role whose payload is "render data → external API call" (e.g. `public_dns` →
+  Gandi LiveDNS), `apply=false` Molecule + data-only pytest exercise the *data file*, not
+  the *rendered module args* — so corrupt-template and API-rejection bugs (`item.values`
+  resolving to a dict method; Gandi rejecting RFC-7505 null-MX `0 .`) sail through both,
+  plus review. Only a real (or `--check`) call against the API surfaces them.
+- → Treat a **check-mode run against the real API as a required gate** for such roles, or
+  build a render-only assertion that materializes and inspects the rendered module args.
--- a/roles/public_dns/README.md
+++ b/roles/public_dns/README.md
@ -28,3 +28,7 @@ Everything else is reached over LAN/mesh and never appears here.
 The zone is reconciled **additively** plus an explicit `absent` list (Gandi seeds 13
 default records on a new `.me`; we purge the unwanted 11 and overwrite MX/SPF with the
 anti-spoof baseline). Full-zone authoritative pruning is a future enhancement (TODO 8.3).
+
+**Gandi rejects RFC-7505 null-MX (`0 .`)** with "invalid format for MX record" — so a
+no-mail domain can't use the standard null-MX. We instead **remove the MX entirely** (no
+MX + no apex A = no mail) and rely on SPF `-all` + DMARC `reject` to prevent spoofing.