From 13ae674cc921c13de86f3040479f98051f375b91 Mon Sep 17 00:00:00 2001 From: sjat Date: Sun, 14 Jun 2026 21:46:23 +0200 Subject: [PATCH] =?UTF-8?q?chore(kaizen):=20first=20/kaizen=20run=20?= =?UTF-8?q?=E2=80=94=20curate=2012=20friction=20signals?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Dogfood of the new /kaizen command. 11 consumed, 1 kept open. - SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule tag-isolation testing, API/templating render-only gap); CLAUDE.md (item['key'] loop convention, TF module required_providers); public_dns README (Gandi null-MX workaround). - CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate (verified: blocks the gate, passes meta-discussion). - SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder; ADR-004 documents the cross-role-naming convention. - ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024; subagent-faithfulness now embodied in the two-stage subagent review. - KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation. Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green. Co-Authored-By: Claude Opus 4.8 (1M context) --- .claude/hooks/guard-execution-mode-menu.sh | 35 +++-- CLAUDE.md | 6 + Makefile | 9 +- docs/FRICTION.md | 164 +++++---------------- docs/decisions/004-docker-model.md | 7 + docs/testing/gotchas.md | 36 +++++ roles/public_dns/README.md | 4 + 7 files changed, 120 insertions(+), 141 deletions(-) diff --git a/.claude/hooks/guard-execution-mode-menu.sh b/.claude/hooks/guard-execution-mode-menu.sh index f9f47c7..cc1d0ae 100755 --- a/.claude/hooks/guard-execution-mode-menu.sh +++ b/.claude/hooks/guard-execution-mode-menu.sh @@ -1,17 +1,20 @@ #!/usr/bin/env bash # -# Stop guard: block ending the turn when the assistant's final message presents the -# execution-mode menu. The writing-plans / subagent-driven-development skills script a -# "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution -# handoff. boma's standing preference (docs/FRICTION.md + the -# always-subagent-driven-execution memory) is to NEVER present it and proceed -# subagent-driven. Prose reminders failed four times (06-05/06/09/10); this is the -# mechanical guard recorded by the 2026-06-10 kaizen review. +# Stop guard for two external-skill gates that conflict with boma conventions, where +# prose reminders repeatedly failed to hold (docs/FRICTION.md): +# +# 1. The execution-mode menu — writing-plans / subagent-driven-development script a +# "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution +# handoff. boma's standing preference is to NEVER present it and proceed +# subagent-driven. (Recorded by the 2026-06-10 kaizen review.) +# 2. The brainstorming spec-review gate — the brainstorming skill scripts "Spec written +# and committed … please review it before … the implementation plan." The standing +# agreement is to move directly from the committed spec to writing-plans. (Recorded +# by the 2026-06-14 kaizen review; 06-10/06-14 recurrences.) # # Fails OPEN: any parse/read problem → allow the stop. Respects stop_hook_active so a -# block can never loop. The match signature is deliberately tight ("inline execution" -# AND "which approach"/"two execution options") so it fires on the actual menu, not on -# meta-discussion of it. +# block can never loop. Match signatures are deliberately tight so they fire on the +# actual gate text, not on meta-discussion of it. # set -uo pipefail @@ -43,4 +46,16 @@ JSON exit 0 fi +# Brainstorming spec-review gate: asking the user to review the committed spec before +# the implementation plan. Tight signature: "implementation plan" present, plus either the +# skill's literal "spec written and committed" line, or the review+spec+before combination. +if [[ "$low" == *"implementation plan"* \ + && ( "$low" == *"spec written and committed"* \ + || ( "$low" == *"review"* && "$low" == *"the spec"* && "$low" == *"before"* ) ) ]]; then + cat <<'JSON' +{"decision":"block","reason":"Brainstorming spec-review gate detected in your final message. boma standing agreement (docs/FRICTION.md): once the spec is written and committed, move directly to the implementation plan (superpowers:writing-plans) — do not stop to ask the user to review the spec first. Drop the gate and proceed."} +JSON + exit 0 +fi + exit 0 diff --git a/CLAUDE.md b/CLAUDE.md index 5616225..e8a0640 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -62,6 +62,9 @@ Full design rationale: `docs/decisions/` - **Variables**: `rolename__varname` double-underscore namespace for role defaults - **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only - **Loops**: prefer `loop:` over `with_items:` +- **Loop var keys**: index with `item['key']`, never `item.key` — a key named + `values`/`keys`/`items`/`get`/… resolves to the dict *method* (silently corrupt + + non-idempotent), not the value - **Conditionals**: prefer `true`/`false` over `yes`/`no` --- @@ -178,6 +181,9 @@ Single-contributor, trunk-based (no merge requests / approval gates): - Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files - `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored - `.terraform.lock.hcl` is tracked (pins provider versions) +- Every module declares its own `required_providers` (in `versions.tf`) for any + non-hashicorp provider — otherwise TF infers `hashicorp/` and `init` fails + (caught only by a live `tf-init`, not by static review) - Full rationale: `docs/decisions/006-terraform.md` --- diff --git a/Makefile b/Makefile index 7554245..c3bdcb6 100644 --- a/Makefile +++ b/Makefile @@ -181,7 +181,14 @@ endif roles/$(NAME)/molecule/default echo "---" > roles/$(NAME)/tasks/main.yml echo "---" > roles/$(NAME)/handlers/main.yml - echo "---" > roles/$(NAME)/defaults/main.yml + printf '%s\n' '---' \ + '# Role defaults use the __var double-underscore namespace.' \ + '#' \ + '# Service roles (ADR-004) also declare access__*/backup__* data here. Those are' \ + '# cross-role conventions (not rolename-prefixed), so EACH such line needs a trailing' \ + '# noqa: var-naming[no-role-prefix] (ansible-lint 24.x has no per-prefix allowlist).' \ + '# Reference: roles/reverse_proxy/defaults/main.yml' \ + > roles/$(NAME)/defaults/main.yml echo "---" > roles/$(NAME)/meta/main.yml printf '# %s\n\nRole description here.\n' "$(NAME)" > roles/$(NAME)/README.md cp .scaffold/molecule.yml roles/$(NAME)/molecule/default/molecule.yml diff --git a/docs/FRICTION.md b/docs/FRICTION.md index 0fae4f9..be2ceed 100644 --- a/docs/FRICTION.md +++ b/docs/FRICTION.md @@ -1,14 +1,15 @@ # FRICTION.md — kaizen friction log -Raw signals for the periodic **kaizen review** (the methodology retrospective; see -`docs/TODO.md`). This is the input that keeps our tooling and conventions sharpening -over time instead of only accreting. +Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is +the input that keeps our tooling and conventions sharpening over time instead of only +accreting. **How to use:** append freely _during_ work under **Open signals** — don't curate, don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling -that isn't earning its keep. The kaizen review reads this, then proposes -**add / change / remove** (biased toward _remove_), migrates durable knowledge into the -right docs, and moves consumed signals into the **decisions ledger** below. +that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal +(SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased +toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs, +and moves consumed signals into the **decisions ledger** below. **Entry format:** `date — [tag] observation — (optional) → systematization idea` Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour · @@ -21,127 +22,6 @@ earning its keep. _(append new raw signals here; the next kaizen review consumes them)_ -- `[gotcha]` **Hetzner IPs are 403'd by Google's Go module infra; caddy-dns/gandi DNS-01 - didn't issue** (2026-06-14, M4a): building the custom Caddy image *on askari* failed — - `proxy.golang.org` and `golang.org` both return **403 Forbidden** to the Hetzner IP - (worked on ubongo). Reworked the role to build on the control node + `docker save`/`load` - to the target. *Then* the `caddy-dns/gandi` DNS-01 plugin would not create the - `_acme-challenge` TXT despite a token verified to (a) be in Caddy's env and (b) create - TXT records via the Gandi API directly — no plugin error, just "propagation timeout, - last error "; resolvers/timeout tuning didn't help. **Resolution:** askari is a - *public* host, so switched it to **HTTP-01 + vanilla Caddy** (works, drops the custom - image entirely). DNS-01 deferred to Phase 2 (cluster's mesh/LAN-only services) — the - plugin + the Hetzner-build-block to be solved then. → lesson: prefer HTTP-01 wherever a - host is publicly reachable; reserve DNS-01 (and its plugin/build complexity) for hosts - that genuinely can't do HTTP-01. Both bugs surfaced only on the live host. - -- `[gotcha]` **A tag on `include_tasks` does NOT reach the included tasks — need - `apply: {tags:}`** (2026-06-14): M3's `base/tasks/main.yml` tagged the ssh/fail2ban - `include_tasks` with `hardening`, but `make deploy … TAGS=hardening` ran *nothing* - (`ok=3 changed=0`) — a tag on a dynamic include selects the include, not its contents. - Fix: `include_tasks: {file: x.yml, apply: {tags: [hardening]}}`. The same latent bug sat - in the firewall include (never hit — firewall was only ever run untagged). Also the - check-mode artifact: a `service`/handler for a not-yet-installed package fails in a - first-run `--check` → guard with `when: not ansible_check_mode`. Both caught only by the - **live `make check`/`deploy` on askari** — Molecule converges *untagged*, so it can't - catch tag-propagation. 3rd reinforcement (after M1 `item.values`, M2 TF - `required_providers`) that live execution catches what review + container tests miss. - → when a role uses tags to apply concern-subsets, `apply:` is mandatory on its includes; - consider an ansible-lint/CI check that `make deploy … TAGS=` actually changes things. - -- `[gotcha]` **Terraform child modules need their own `required_providers` for - non-hashicorp providers** (2026-06-14): `terraform init` for the `offsite` env failed — - the `hetzner_vm` module used `hcloud_*` resources with no `required_providers` block, so - TF inferred `hashicorp/hcloud` (nonexistent). The `proxmox_vm` module had the **identical - latent bug**, never caught because Proxmox TF was never `init`ed. Both the terraform-MCP - schema check and the final review subagent missed it; only `make tf-init/plan` on ubongo - caught it. Reinforces the M1 signal that **live/real execution catches what static review - can't** — now for Terraform. → always give a TF module its own `versions.tf` with - `required_providers`; treat "reviewed but never run" as a structural blind spot. - -- `[gotcha]` **`item.values` in a loop sends the dict's `.values()` METHOD, not the - key** (2026-06-14): the `public_dns` role looped over records that have a `values:` - key and used `{{ item.values }}` in the `gandi_livedns` task. Jinja attribute access - resolved `item.values` to the built-in dict method, so Gandi received - `""` as the live TXT value — corrupt - **and** non-idempotent (the address changes each run → always "changed"). The fix is - bracket-indexing: `item['values']` (same risk for any key named `keys`/`items`/`get`/ - `update`/...). → convention: in loops, index loop-var keys with `item['key']`, never - `item.key`; consider an ansible-lint guard. -- `[gotcha]` **Gandi LiveDNS rejects RFC-7505 null-MX `0 .`** (2026-06-14): "invalid - format for MX record." Used "no MX + no apex A" + SPF `-all` + DMARC reject instead. - Minor, but worth a note for any future no-mail domain on Gandi. -- `[recurring]` **apply=false Molecule + data-only pytest leave a real gap for - API/templating roles** (2026-06-14): both the null-MX and the `item.values` bugs sailed - through the spec, BOTH review subagents, the pytest (validates the data file, not the - rendered template), and the Molecule scenario (`apply=false`, so the API tasks never - run) — only the **live `make check`/`deploy`** against the real Gandi API surfaced them. - For roles whose payload is "render data → external API call", the rendered template is - the thing that breaks, and nothing short of a real (or check-mode) API call exercises it. - → for such roles, treat a check-mode run against the real API as a required gate, not an - optional final step; or build a render-only assertion that materializes the module args. - -- `[recurring]` **Execution-mode menu asked AGAIN despite the 2026-06-10 "mechanical - fix"** (2026-06-14): at the M1 (`public_dns`) plan handoff I presented the "1. - Subagent-Driven / 2. Inline Execution — which approach?" menu and asked the user to - pick. The decisions ledger (2026-06-10) records this exact behaviour as CHANGE → - mechanical: *"Stop hook in `.claude/settings.json` blocks the turn if the menu appears - and tells me to proceed subagent-driven."* It did not fire — either the hook is absent - in this clone, its matcher doesn't match the wording the `writing-plans` skill actually - produces, or it isn't installed/active. The standing agreement is to **default straight - to subagent-driven without asking**. → verify the Stop hook exists and that its pattern - matches the real menu text (the skill scripts "Two execution options" / "Which - approach?"); if it relies on `.claude/settings.json` hooks that aren't active here, - that's the gap. 5th occurrence (06-05/06/09/10/14). - -- `[friction]` **ADR-writing policy is unsettled** (2026-05-31): drafting an ADR, I - invented a Status header ("Proposed") on the fly because there's no documented - convention for how we write ADRs (status lifecycle, required sections). → TODO 10.2 — - decide a minimal ADR template / status convention. -- `[recurring]` **Brainstorming's "user reviews spec" gate fires despite a standing - agreement to skip it** (2026-06-10): writing the ADR-structure spec, I stopped to ask - the user to review the finished spec before writing the plan — the - `superpowers:brainstorming` skill scripts that gate. We had previously agreed I should - move directly from the Q/A to the implementation plan once the spec is written. Same - shape as the execution-mode-menu signal: an external skill's script conflicting with a - boma convention, where prose reminders don't hold. → consider a mechanical guard - (Stop-hook family) or a CLAUDE.md/skill-override note that suppresses the spec-review - gate. -- `[recurring]` **Subagent faithfulness self-reports can be wrong — controller must - diff** (2026-06-10): during the ADR-023 retroactive restructure, an implementer - subagent reported "0 substantive deletions, the See-also lines reappear verbatim" for - ADR-014, but it had actually dropped the cross-reference lines. Caught only by the - controller independently running `git show | grep '^-[^-]'`. For - faithfulness-critical edits delegated to subagents, the agent's own audit is not - sufficient evidence. → systematize a controller-side deletion-audit step (every `-` - line must be a classified, expected change) before accepting any "presentational-only" - restructure; consider a helper script. - -- `[friction]` **ansible-lint `var-naming[no-role-prefix]` rejects the ADR-021/022 - `access__*`/`backup__*` cross-role field names** (2026-06-14): building the first - service role's records (`reverse_proxy`), adding the ADR-mandated `access__*` / - `backup__*` data to `defaults/main.yml` failed lint — the rule requires every role var - to start with `_`, and ansible-lint 24.x has **no per-prefix allowlist**. The - double-underscore `reverse_proxy__*` namespace passes (starts with `reverse_proxy_`), - but the deliberately shared `access__`/`backup__` names don't. Resolved with inline - `# noqa: var-naming[no-role-prefix]` per var (keeps the rule enforced elsewhere). This - **will recur in every service role**. → decide a project-wide policy before the next - service role: a documented `.ansible-lint` stance, a sanctioned noqa snippet baked into - the `make new-role` scaffold, or reconcile the convention. First collision because - `reverse_proxy` is the first built service role. - -- `[gotcha]` **Molecule CAN exercise tag-propagation, but only with a tagged converge + - full-then-partial sequencing** (2026-06-14): closing part of the 2026-06-14 `apply: - {tags:}` signal ("Molecule converges untagged, so it can't catch tag-propagation"). Added - a second converge play (`include_role` with `apply: {tags: [config]}` + a fresh user) - and an assertion, then proved the fix with `molecule converge -- --tags config`. Caveat - learned the hard way: a partial-tag run on a **fresh** instance fails on cross-concern - deps (a `config` task needs `git`, installed by the `packages` concern), and untagged - pre_tasks (test-user creation) get filtered out — so the realistic test is **full - converge → partial re-run** (idempotent), and harness pre_tasks need `tags: [always]`. - → adopt the tagged-converge-play pattern for any role with concern subsets; this is the - CI check the prior signal asked for, in Molecule rather than `make deploy`. - - `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform** (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said @@ -151,6 +31,7 @@ _(append new raw signals here; the next kaizen review consumes them)_ its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`. + (KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.) --- @@ -158,6 +39,28 @@ _(append new raw signals here; the next kaizen review consumes them)_ Consumed signals and where their resolution now lives. Newest first. +### 2026-06-14 + +First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above — +a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none +of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process, +which migrate or archive (knowledge is never deleted). + +| Signal (first seen) | Verdict | Resolution / where it lives now | +|---|---|---| +| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. | +| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. | +| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. | +| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. | +| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT | ADR-024's revised Status records the HTTP-01 decision, the DNS-01 deferral to Phase 2, and the Hetzner-build + plugin blocks. | +| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". | +| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". | +| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". | +| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). | +| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). | +| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. | +| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). | + ### 2026-06-10 | Signal (first seen) | Verdict | Resolution / where it lives now | @@ -172,6 +75,7 @@ Consumed signals and where their resolution now lives. Newest first. | hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". | | `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. | -**Process note:** the `/retro` tool (TODO 11) still isn't built, so this review was -manual. Curating by hand (migrate durable knowledge → docs, archive consumed signals → -this ledger) worked well; fold that curation step into `/retro` when it's built. +**Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't +built). The 2026-06-14 block was the **first run of `/kaizen`** itself +(`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both +cleared the backlog and validated the command. diff --git a/docs/decisions/004-docker-model.md b/docs/decisions/004-docker-model.md index 1fb355d..674425b 100644 --- a/docs/decisions/004-docker-model.md +++ b/docs/decisions/004-docker-model.md @@ -51,6 +51,13 @@ below). Each service role contains a standard set of files: | `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) | | `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario | +The `access__*` (ADR-021) and `backup__*` (ADR-022) data in `defaults/main.yml` are +**cross-role conventions** — shared field names that deliberately do *not* carry the +`__` prefix. ansible-lint's `var-naming[no-role-prefix]` has no per-prefix +allowlist, so each such line carries a trailing `# noqa: var-naming[no-role-prefix]` (the +rule stays enforced for genuinely role-scoped vars). `make new-role` scaffolds a reminder; +`roles/reverse_proxy/defaults/main.yml` is the reference. + ### Standard deploy mechanics Every service role's `tasks/main.yml` follows the same sequence, so all roles are diff --git a/docs/testing/gotchas.md b/docs/testing/gotchas.md index 6d840c1..d9be069 100644 --- a/docs/testing/gotchas.md +++ b/docs/testing/gotchas.md @@ -34,3 +34,39 @@ testing surprise is worth remembering past the session that hit it. apply/safety paths Molecule can't exercise, validate out-of-band (a throwaway `--privileged` container with its own netns) and treat a final adversarial review as **mandatory, not optional**. + +## Tags on dynamic `include_tasks` need `apply:` to reach the included tasks + +- **A tag on a dynamic `include_tasks` selects the include statement, not its contents.** + Tagging `include_tasks: x.yml` with `concern` and running `--tags concern` runs + *nothing* (`ok=N changed=0`) unless the included tasks are independently tagged. Use + `include_tasks: {file: x.yml, apply: {tags: [concern]}}` to propagate the tag onto the + included tasks — **mandatory** whenever a role uses tags to apply concern-subsets + (`roles/base/tasks/main.yml` and `roles/dev_env/tasks/main.yml` are the references). +- **Molecule converges *untagged*, so it cannot catch this by default** — the bug only + shows under `make deploy … TAGS=` on a real host (first hit live on askari, M3). + See the tag-isolation pattern below to catch it in Molecule instead. +- **Check-mode artifact:** a `service`/handler for a not-yet-installed package fails in a + first-run `--check`; guard with `when: not ansible_check_mode`. + +## Testing concern-tag isolation in Molecule + +- To catch the tag-propagation bug above *in Molecule*, add a **second converge play** + that applies one concern to a fresh target — `include_role` with `apply: {tags: [config]}` + — plus a `verify` assertion that the concern's effect landed. Drive the real partial + path with `molecule converge -- --tags config`. +- **Sequence matters:** a partial-tag run on a *fresh* instance fails on cross-concern + deps (a `config` task may need a binary the `packages` concern installs). The realistic + test is **full converge → partial `--tags` re-run** (idempotent). Harness `pre_tasks` + (e.g. test-user creation) must be tagged `always`, or `--tags` filters them out. + (Pattern proven on `dev_env`, 2026-06-14.) + +## API / templating roles: render-only tests miss the real call + +- For a role whose payload is "render data → external API call" (e.g. `public_dns` → + Gandi LiveDNS), `apply=false` Molecule + data-only pytest exercise the *data file*, not + the *rendered module args* — so corrupt-template and API-rejection bugs (`item.values` + resolving to a dict method; Gandi rejecting RFC-7505 null-MX `0 .`) sail through both, + plus review. Only a real (or `--check`) call against the API surfaces them. +- → Treat a **check-mode run against the real API as a required gate** for such roles, or + build a render-only assertion that materializes and inspects the rendered module args. diff --git a/roles/public_dns/README.md b/roles/public_dns/README.md index b23d854..6fd1db9 100644 --- a/roles/public_dns/README.md +++ b/roles/public_dns/README.md @@ -28,3 +28,7 @@ Everything else is reached over LAN/mesh and never appears here. The zone is reconciled **additively** plus an explicit `absent` list (Gandi seeds 13 default records on a new `.me`; we purge the unwanted 11 and overwrite MX/SPF with the anti-spoof baseline). Full-zone authoritative pruning is a future enhancement (TODO 8.3). + +**Gandi rejects RFC-7505 null-MX (`0 .`)** with "invalid format for MX record" — so a +no-mail domain can't use the standard null-MX. We instead **remove the MX entirely** (no +MX + no apex A = no mail) and rely on SPF `-all` + DMARC `reject` to prevent spoofing.