chore(kaizen): first /kaizen run — curate 12 friction signals

Dogfood of the new /kaizen command. 11 consumed, 1 kept open.
- SYSTEMATIZE → docs/testing/gotchas.md (apply:{tags} propagation, Molecule
  tag-isolation testing, API/templating render-only gap); CLAUDE.md
  (item['key'] loop convention, TF module required_providers); public_dns
  README (Gandi null-MX workaround).
- CHANGE → extend the Stop hook to also guard the brainstorming spec-review gate
  (verified: blocks the gate, passes meta-discussion).
- SYSTEMATIZE → make new-role scaffolds the access__/backup__ noqa reminder;
  ADR-004 documents the cross-role-naming convention.
- ALREADY-BUILT/ACCEPTED → exec-menu guard verified firing; ADR-023; ADR-024;
  subagent-faithfulness now embodied in the two-stage subagent review.
- KEEP-OPEN → a repo-scan.py check for ADRs that over-claim reconciliation.

Nudge: OVERDUE (13 signals) → ok (1). make lint + 16 friction-scan tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-14 21:46:23 +02:00
parent d1e1e38879
commit 13ae674cc9
7 changed files with 120 additions and 141 deletions

View file

@ -1,17 +1,20 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# #
# Stop guard: block ending the turn when the assistant's final message presents the # Stop guard for two external-skill gates that conflict with boma conventions, where
# execution-mode menu. The writing-plans / subagent-driven-development skills script a # prose reminders repeatedly failed to hold (docs/FRICTION.md):
# "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution #
# handoff. boma's standing preference (docs/FRICTION.md + the # 1. The execution-mode menu — writing-plans / subagent-driven-development script a
# always-subagent-driven-execution memory) is to NEVER present it and proceed # "Subagent-Driven vs Inline Execution — which approach?" menu at the plan→execution
# subagent-driven. Prose reminders failed four times (06-05/06/09/10); this is the # handoff. boma's standing preference is to NEVER present it and proceed
# mechanical guard recorded by the 2026-06-10 kaizen review. # subagent-driven. (Recorded by the 2026-06-10 kaizen review.)
# 2. The brainstorming spec-review gate — the brainstorming skill scripts "Spec written
# and committed … please review it before … the implementation plan." The standing
# agreement is to move directly from the committed spec to writing-plans. (Recorded
# by the 2026-06-14 kaizen review; 06-10/06-14 recurrences.)
# #
# Fails OPEN: any parse/read problem → allow the stop. Respects stop_hook_active so a # Fails OPEN: any parse/read problem → allow the stop. Respects stop_hook_active so a
# block can never loop. The match signature is deliberately tight ("inline execution" # block can never loop. Match signatures are deliberately tight so they fire on the
# AND "which approach"/"two execution options") so it fires on the actual menu, not on # actual gate text, not on meta-discussion of it.
# meta-discussion of it.
# #
set -uo pipefail set -uo pipefail
@ -43,4 +46,16 @@ JSON
exit 0 exit 0
fi fi
# Brainstorming spec-review gate: asking the user to review the committed spec before
# the implementation plan. Tight signature: "implementation plan" present, plus either the
# skill's literal "spec written and committed" line, or the review+spec+before combination.
if [[ "$low" == *"implementation plan"* \
&& ( "$low" == *"spec written and committed"* \
|| ( "$low" == *"review"* && "$low" == *"the spec"* && "$low" == *"before"* ) ) ]]; then
cat <<'JSON'
{"decision":"block","reason":"Brainstorming spec-review gate detected in your final message. boma standing agreement (docs/FRICTION.md): once the spec is written and committed, move directly to the implementation plan (superpowers:writing-plans) — do not stop to ask the user to review the spec first. Drop the gate and proceed."}
JSON
exit 0
fi
exit 0 exit 0

View file

@ -62,6 +62,9 @@ Full design rationale: `docs/decisions/`
- **Variables**: `rolename__varname` double-underscore namespace for role defaults - **Variables**: `rolename__varname` double-underscore namespace for role defaults
- **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only - **No inline vars in playbooks**: use `group_vars/` or `host_vars/` only
- **Loops**: prefer `loop:` over `with_items:` - **Loops**: prefer `loop:` over `with_items:`
- **Loop var keys**: index with `item['key']`, never `item.key` — a key named
`values`/`keys`/`items`/`get`/… resolves to the dict *method* (silently corrupt +
non-idempotent), not the value
- **Conditionals**: prefer `true`/`false` over `yes`/`no` - **Conditionals**: prefer `true`/`false` over `yes`/`no`
--- ---
@ -178,6 +181,9 @@ Single-contributor, trunk-based (no merge requests / approval gates):
- Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files - Secrets via `TF_VAR_*` env vars only — never in `.tfvars` files
- `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored - `terraform.tfvars.example` is tracked; `terraform.tfvars` is gitignored
- `.terraform.lock.hcl` is tracked (pins provider versions) - `.terraform.lock.hcl` is tracked (pins provider versions)
- Every module declares its own `required_providers` (in `versions.tf`) for any
non-hashicorp provider — otherwise TF infers `hashicorp/<name>` and `init` fails
(caught only by a live `tf-init`, not by static review)
- Full rationale: `docs/decisions/006-terraform.md` - Full rationale: `docs/decisions/006-terraform.md`
--- ---

View file

@ -181,7 +181,14 @@ endif
roles/$(NAME)/molecule/default roles/$(NAME)/molecule/default
echo "---" > roles/$(NAME)/tasks/main.yml echo "---" > roles/$(NAME)/tasks/main.yml
echo "---" > roles/$(NAME)/handlers/main.yml echo "---" > roles/$(NAME)/handlers/main.yml
echo "---" > roles/$(NAME)/defaults/main.yml printf '%s\n' '---' \
'# Role defaults use the <rolename>__var double-underscore namespace.' \
'#' \
'# Service roles (ADR-004) also declare access__*/backup__* data here. Those are' \
'# cross-role conventions (not rolename-prefixed), so EACH such line needs a trailing' \
'# noqa: var-naming[no-role-prefix] (ansible-lint 24.x has no per-prefix allowlist).' \
'# Reference: roles/reverse_proxy/defaults/main.yml' \
> roles/$(NAME)/defaults/main.yml
echo "---" > roles/$(NAME)/meta/main.yml echo "---" > roles/$(NAME)/meta/main.yml
printf '# %s\n\nRole description here.\n' "$(NAME)" > roles/$(NAME)/README.md printf '# %s\n\nRole description here.\n' "$(NAME)" > roles/$(NAME)/README.md
cp .scaffold/molecule.yml roles/$(NAME)/molecule/default/molecule.yml cp .scaffold/molecule.yml roles/$(NAME)/molecule/default/molecule.yml

View file

@ -1,14 +1,15 @@
# FRICTION.md — kaizen friction log # FRICTION.md — kaizen friction log
Raw signals for the periodic **kaizen review** (the methodology retrospective; see Raw signals for the periodic **kaizen review** (`/kaizen`; see `docs/TODO.md` 11). This is
`docs/TODO.md`). This is the input that keeps our tooling and conventions sharpening the input that keeps our tooling and conventions sharpening over time instead of only
over time instead of only accreting. accreting.
**How to use:** append freely _during_ work under **Open signals** — don't curate, **How to use:** append freely _during_ work under **Open signals** — don't curate,
don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling don't fix there. Capture friction, surprises, fixes that keep recurring, and tooling
that isn't earning its keep. The kaizen review reads this, then proposes that isn't earning its keep. `/kaizen` reads this, then proposes a verdict per signal
**add / change / remove** (biased toward _remove_), migrates durable knowledge into the (SYSTEMATIZE / CHANGE / PARK / REMOVE / ALREADY-BUILT / ACCEPTED / KEEP-OPEN; biased
right docs, and moves consumed signals into the **decisions ledger** below. toward _remove/park_ for unused tooling), migrates durable knowledge into the right docs,
and moves consumed signals into the **decisions ledger** below.
**Entry format:** `date — [tag] observation — (optional) → systematization idea` **Entry format:** `date — [tag] observation — (optional) → systematization idea`
Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour · Tags: `[friction]` recurring annoyance · `[gotcha]` surprising behaviour ·
@ -21,127 +22,6 @@ earning its keep.
_(append new raw signals here; the next kaizen review consumes them)_ _(append new raw signals here; the next kaizen review consumes them)_
- `[gotcha]` **Hetzner IPs are 403'd by Google's Go module infra; caddy-dns/gandi DNS-01
didn't issue** (2026-06-14, M4a): building the custom Caddy image *on askari* failed —
`proxy.golang.org` and `golang.org` both return **403 Forbidden** to the Hetzner IP
(worked on ubongo). Reworked the role to build on the control node + `docker save`/`load`
to the target. *Then* the `caddy-dns/gandi` DNS-01 plugin would not create the
`_acme-challenge` TXT despite a token verified to (a) be in Caddy's env and (b) create
TXT records via the Gandi API directly — no plugin error, just "propagation timeout,
last error <nil>"; resolvers/timeout tuning didn't help. **Resolution:** askari is a
*public* host, so switched it to **HTTP-01 + vanilla Caddy** (works, drops the custom
image entirely). DNS-01 deferred to Phase 2 (cluster's mesh/LAN-only services) — the
plugin + the Hetzner-build-block to be solved then. → lesson: prefer HTTP-01 wherever a
host is publicly reachable; reserve DNS-01 (and its plugin/build complexity) for hosts
that genuinely can't do HTTP-01. Both bugs surfaced only on the live host.
- `[gotcha]` **A tag on `include_tasks` does NOT reach the included tasks — need
`apply: {tags:}`** (2026-06-14): M3's `base/tasks/main.yml` tagged the ssh/fail2ban
`include_tasks` with `hardening`, but `make deploy … TAGS=hardening` ran *nothing*
(`ok=3 changed=0`) — a tag on a dynamic include selects the include, not its contents.
Fix: `include_tasks: {file: x.yml, apply: {tags: [hardening]}}`. The same latent bug sat
in the firewall include (never hit — firewall was only ever run untagged). Also the
check-mode artifact: a `service`/handler for a not-yet-installed package fails in a
first-run `--check` → guard with `when: not ansible_check_mode`. Both caught only by the
**live `make check`/`deploy` on askari** — Molecule converges *untagged*, so it can't
catch tag-propagation. 3rd reinforcement (after M1 `item.values`, M2 TF
`required_providers`) that live execution catches what review + container tests miss.
→ when a role uses tags to apply concern-subsets, `apply:` is mandatory on its includes;
consider an ansible-lint/CI check that `make deploy … TAGS=<concern>` actually changes things.
- `[gotcha]` **Terraform child modules need their own `required_providers` for
non-hashicorp providers** (2026-06-14): `terraform init` for the `offsite` env failed —
the `hetzner_vm` module used `hcloud_*` resources with no `required_providers` block, so
TF inferred `hashicorp/hcloud` (nonexistent). The `proxmox_vm` module had the **identical
latent bug**, never caught because Proxmox TF was never `init`ed. Both the terraform-MCP
schema check and the final review subagent missed it; only `make tf-init/plan` on ubongo
caught it. Reinforces the M1 signal that **live/real execution catches what static review
can't** — now for Terraform. → always give a TF module its own `versions.tf` with
`required_providers`; treat "reviewed but never run" as a structural blind spot.
- `[gotcha]` **`item.values` in a loop sends the dict's `.values()` METHOD, not the
key** (2026-06-14): the `public_dns` role looped over records that have a `values:`
key and used `{{ item.values }}` in the `gandi_livedns` task. Jinja attribute access
resolved `item.values` to the built-in dict method, so Gandi received
`"<built-in method values of dict object at 0x...>"` as the live TXT value — corrupt
**and** non-idempotent (the address changes each run → always "changed"). The fix is
bracket-indexing: `item['values']` (same risk for any key named `keys`/`items`/`get`/
`update`/...). → convention: in loops, index loop-var keys with `item['key']`, never
`item.key`; consider an ansible-lint guard.
- `[gotcha]` **Gandi LiveDNS rejects RFC-7505 null-MX `0 .`** (2026-06-14): "invalid
format for MX record." Used "no MX + no apex A" + SPF `-all` + DMARC reject instead.
Minor, but worth a note for any future no-mail domain on Gandi.
- `[recurring]` **apply=false Molecule + data-only pytest leave a real gap for
API/templating roles** (2026-06-14): both the null-MX and the `item.values` bugs sailed
through the spec, BOTH review subagents, the pytest (validates the data file, not the
rendered template), and the Molecule scenario (`apply=false`, so the API tasks never
run) — only the **live `make check`/`deploy`** against the real Gandi API surfaced them.
For roles whose payload is "render data → external API call", the rendered template is
the thing that breaks, and nothing short of a real (or check-mode) API call exercises it.
→ for such roles, treat a check-mode run against the real API as a required gate, not an
optional final step; or build a render-only assertion that materializes the module args.
- `[recurring]` **Execution-mode menu asked AGAIN despite the 2026-06-10 "mechanical
fix"** (2026-06-14): at the M1 (`public_dns`) plan handoff I presented the "1.
Subagent-Driven / 2. Inline Execution — which approach?" menu and asked the user to
pick. The decisions ledger (2026-06-10) records this exact behaviour as CHANGE →
mechanical: *"Stop hook in `.claude/settings.json` blocks the turn if the menu appears
and tells me to proceed subagent-driven."* It did not fire — either the hook is absent
in this clone, its matcher doesn't match the wording the `writing-plans` skill actually
produces, or it isn't installed/active. The standing agreement is to **default straight
to subagent-driven without asking**. → verify the Stop hook exists and that its pattern
matches the real menu text (the skill scripts "Two execution options" / "Which
approach?"); if it relies on `.claude/settings.json` hooks that aren't active here,
that's the gap. 5th occurrence (06-05/06/09/10/14).
- `[friction]` **ADR-writing policy is unsettled** (2026-05-31): drafting an ADR, I
invented a Status header ("Proposed") on the fly because there's no documented
convention for how we write ADRs (status lifecycle, required sections). → TODO 10.2 —
decide a minimal ADR template / status convention.
- `[recurring]` **Brainstorming's "user reviews spec" gate fires despite a standing
agreement to skip it** (2026-06-10): writing the ADR-structure spec, I stopped to ask
the user to review the finished spec before writing the plan — the
`superpowers:brainstorming` skill scripts that gate. We had previously agreed I should
move directly from the Q/A to the implementation plan once the spec is written. Same
shape as the execution-mode-menu signal: an external skill's script conflicting with a
boma convention, where prose reminders don't hold. → consider a mechanical guard
(Stop-hook family) or a CLAUDE.md/skill-override note that suppresses the spec-review
gate.
- `[recurring]` **Subagent faithfulness self-reports can be wrong — controller must
diff** (2026-06-10): during the ADR-023 retroactive restructure, an implementer
subagent reported "0 substantive deletions, the See-also lines reappear verbatim" for
ADR-014, but it had actually dropped the cross-reference lines. Caught only by the
controller independently running `git show <sha> | grep '^-[^-]'`. For
faithfulness-critical edits delegated to subagents, the agent's own audit is not
sufficient evidence. → systematize a controller-side deletion-audit step (every `-`
line must be a classified, expected change) before accepting any "presentational-only"
restructure; consider a helper script.
- `[friction]` **ansible-lint `var-naming[no-role-prefix]` rejects the ADR-021/022
`access__*`/`backup__*` cross-role field names** (2026-06-14): building the first
service role's records (`reverse_proxy`), adding the ADR-mandated `access__*` /
`backup__*` data to `defaults/main.yml` failed lint — the rule requires every role var
to start with `<rolename>_`, and ansible-lint 24.x has **no per-prefix allowlist**. The
double-underscore `reverse_proxy__*` namespace passes (starts with `reverse_proxy_`),
but the deliberately shared `access__`/`backup__` names don't. Resolved with inline
`# noqa: var-naming[no-role-prefix]` per var (keeps the rule enforced elsewhere). This
**will recur in every service role**. → decide a project-wide policy before the next
service role: a documented `.ansible-lint` stance, a sanctioned noqa snippet baked into
the `make new-role` scaffold, or reconcile the convention. First collision because
`reverse_proxy` is the first built service role.
- `[gotcha]` **Molecule CAN exercise tag-propagation, but only with a tagged converge +
full-then-partial sequencing** (2026-06-14): closing part of the 2026-06-14 `apply:
{tags:}` signal ("Molecule converges untagged, so it can't catch tag-propagation"). Added
a second converge play (`include_role` with `apply: {tags: [config]}` + a fresh user)
and an assertion, then proved the fix with `molecule converge -- --tags config`. Caveat
learned the hard way: a partial-tag run on a **fresh** instance fails on cross-concern
deps (a `config` task needs `git`, installed by the `packages` concern), and untagged
pre_tasks (test-user creation) get filtered out — so the realistic test is **full
converge → partial re-run** (idempotent), and harness pre_tasks need `tags: [always]`.
→ adopt the tagged-converge-play pattern for any role with concern subsets; this is the
CI check the prior signal asked for, in Molecule rather than `make deploy`.
- `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform** - `[recurring]` **ADRs claim cross-doc reconciliation they didn't actually perform**
(2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned (2026-06-14): ADR-024's Status + Consequences asserted "ADR-017 prose that mentioned
Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said Traefik is updated to read Caddy" — but ADR-008/017/019 + CAPABILITIES still said
@ -151,6 +31,7 @@ _(append new raw signals here; the next kaizen review consumes them)_
its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text its promised ripple edits don't). → candidate `repo-scan.py` check: when an ADR's text
asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the asserts "X is updated to Y" / supersedes a named tool, flag remaining occurrences of the
old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`. old name (or verify the claimed edit landed) — the structural cousin of `stale-deferred`.
(KEEP-OPEN per the 2026-06-14 `/kaizen` run — it's its own build task.)
--- ---
@ -158,6 +39,28 @@ _(append new raw signals here; the next kaizen review consumes them)_
Consumed signals and where their resolution now lives. Newest first. Consumed signals and where their resolution now lives. Newest first.
### 2026-06-14
First `/kaizen` run (dogfood). 12 signals triaged; 11 consumed, 1 kept open (#13 above —
a `repo-scan.py` check is its own build). **Bias-to-remove note:** zero PARK/REMOVE — none
of the open signals were `[unused]` *tooling*; they were all knowledge/gotchas/process,
which migrate or archive (knowledge is never deleted).
| Signal (first seen) | Verdict | Resolution / where it lives now |
|---|---|---|
| Execution-mode menu asked AGAIN — 5× (06-05→06-14) | ALREADY-BUILT | The 06-10 mechanical guard (`.claude/hooks/guard-execution-mode-menu.sh`, wired in `.claude/settings.json`) is **verified firing** on the real writing-plans menu text (tested 06-14). The 06-14 miss was hook-activation timing (the known "hooks-need-restart" gotcha), not a matcher defect. |
| Brainstorming spec-review gate fires despite the standing agreement (06-10) | CHANGE → mechanical | Extended the same Stop hook with a tight second matcher (review + "the spec" + "before" + "implementation plan", or the literal "spec written and committed"); tested to block the gate and pass meta-discussion. Same external-skill-script-vs-convention family as the execution menu. |
| Subagent faithfulness self-reports can be wrong (06-10) | ACCEPTED | The mitigation — independent two-stage review where the reviewer is told "do not trust the report" and reads the actual diff — is now embodied in `superpowers:subagent-driven-development`, used for the `/kaizen` build itself. Revisit if it recurs. |
| ADR-writing policy unsettled (05-31) | ALREADY-BUILT | ADR-023 (ADR structure & lifecycle) + `docs/decisions/adr-template.md` settle status/sections — both postdate this signal. |
| Hetzner 403 / caddy-dns DNS-01 didn't issue (06-14) | ALREADY-BUILT | ADR-024's revised Status records the HTTP-01 decision, the DNS-01 deferral to Phase 2, and the Hetzner-build + plugin blocks. |
| `apply:{tags}` not propagated by dynamic `include_tasks` (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Tags on dynamic `include_tasks` need `apply:`". |
| Molecule CAN test tag-propagation, via a tagged converge (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "Testing concern-tag isolation in Molecule". |
| apply=false Molecule + data-pytest gap for API/templating roles (06-14) | SYSTEMATIZE | → `docs/testing/gotchas.md` — "API / templating roles: render-only tests miss the real call". |
| `item.values` in a loop sends the dict method, not the key (06-14) | SYSTEMATIZE | → CLAUDE.md Ansible conventions ("index loop-var keys with `item['key']`, never `item.key`"). |
| TF child modules need their own `required_providers` (06-14) | SYSTEMATIZE | → CLAUDE.md Terraform conventions ("every module declares its own `required_providers` in `versions.tf`"). |
| ansible-lint `var-naming` rejects `access__`/`backup__` cross-role names (06-14) | SYSTEMATIZE | → `make new-role` scaffolds a noqa reminder in `defaults/main.yml`; ADR-004's service-role section documents the convention; `roles/reverse_proxy/defaults/main.yml` is the reference. |
| Gandi rejects RFC-7505 null-MX `0 .` (06-14) | MIGRATE | → `roles/public_dns/README.md` Notes (no MX + SPF `-all` + DMARC reject for a no-mail domain). |
### 2026-06-10 ### 2026-06-10
| Signal (first seen) | Verdict | Resolution / where it lives now | | Signal (first seen) | Verdict | Resolution / where it lives now |
@ -172,6 +75,7 @@ Consumed signals and where their resolution now lives. Newest first.
| hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". | | hooks-need-restart, pre-commit stashes unstaged, `rbw sync` stale cache, zsh word-split (05-30) | MIGRATE | → `docs/runbooks/claude-code-setup.md` "Environment gotchas". |
| `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. | | `finishing-a-development-branch` offers open-a-PR vs our trunk-based merge (06-01) | accepted | Same root cause as the menu ask (external skill script vs boma convention). CLAUDE.md already mandates trunk-based merge-to-main; covered by the Stop-hook family + awareness. Revisit if it recurs. |
**Process note:** the `/retro` tool (TODO 11) still isn't built, so this review was **Process note:** the 2026-06-10 review was manual (the `/retro`/`/kaizen` tool wasn't
manual. Curating by hand (migrate durable knowledge → docs, archive consumed signals → built). The 2026-06-14 block was the **first run of `/kaizen`** itself
this ledger) worked well; fold that curation step into `/retro` when it's built. (`scripts/friction-scan.py` Phase 0 + `.claude/commands/kaizen.md`); the dogfood both
cleared the backlog and validated the command.

View file

@ -51,6 +51,13 @@ below). Each service role contains a standard set of files:
| `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) | | `BACKUP.md` | Per-service backup record — see ADR-022 and `docs/backup/service-backup-template.md` (a stateless service declares `backup__state: false` with a reason) |
| `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario | | `meta/main.yml`, `molecule/default/` | Metadata + Debian 13 test scenario |
The `access__*` (ADR-021) and `backup__*` (ADR-022) data in `defaults/main.yml` are
**cross-role conventions** — shared field names that deliberately do *not* carry the
`<rolename>__` prefix. ansible-lint's `var-naming[no-role-prefix]` has no per-prefix
allowlist, so each such line carries a trailing `# noqa: var-naming[no-role-prefix]` (the
rule stays enforced for genuinely role-scoped vars). `make new-role` scaffolds a reminder;
`roles/reverse_proxy/defaults/main.yml` is the reference.
### Standard deploy mechanics ### Standard deploy mechanics
Every service role's `tasks/main.yml` follows the same sequence, so all roles are Every service role's `tasks/main.yml` follows the same sequence, so all roles are

View file

@ -34,3 +34,39 @@ testing surprise is worth remembering past the session that hit it.
apply/safety paths Molecule can't exercise, validate out-of-band (a throwaway apply/safety paths Molecule can't exercise, validate out-of-band (a throwaway
`--privileged` container with its own netns) and treat a final adversarial review as `--privileged` container with its own netns) and treat a final adversarial review as
**mandatory, not optional**. **mandatory, not optional**.
## Tags on dynamic `include_tasks` need `apply:` to reach the included tasks
- **A tag on a dynamic `include_tasks` selects the include statement, not its contents.**
Tagging `include_tasks: x.yml` with `concern` and running `--tags concern` runs
*nothing* (`ok=N changed=0`) unless the included tasks are independently tagged. Use
`include_tasks: {file: x.yml, apply: {tags: [concern]}}` to propagate the tag onto the
included tasks — **mandatory** whenever a role uses tags to apply concern-subsets
(`roles/base/tasks/main.yml` and `roles/dev_env/tasks/main.yml` are the references).
- **Molecule converges *untagged*, so it cannot catch this by default** — the bug only
shows under `make deploy … TAGS=<concern>` on a real host (first hit live on askari, M3).
See the tag-isolation pattern below to catch it in Molecule instead.
- **Check-mode artifact:** a `service`/handler for a not-yet-installed package fails in a
first-run `--check`; guard with `when: not ansible_check_mode`.
## Testing concern-tag isolation in Molecule
- To catch the tag-propagation bug above *in Molecule*, add a **second converge play**
that applies one concern to a fresh target — `include_role` with `apply: {tags: [config]}`
— plus a `verify` assertion that the concern's effect landed. Drive the real partial
path with `molecule converge -- --tags config`.
- **Sequence matters:** a partial-tag run on a *fresh* instance fails on cross-concern
deps (a `config` task may need a binary the `packages` concern installs). The realistic
test is **full converge → partial `--tags` re-run** (idempotent). Harness `pre_tasks`
(e.g. test-user creation) must be tagged `always`, or `--tags` filters them out.
(Pattern proven on `dev_env`, 2026-06-14.)
## API / templating roles: render-only tests miss the real call
- For a role whose payload is "render data → external API call" (e.g. `public_dns`
Gandi LiveDNS), `apply=false` Molecule + data-only pytest exercise the *data file*, not
the *rendered module args* — so corrupt-template and API-rejection bugs (`item.values`
resolving to a dict method; Gandi rejecting RFC-7505 null-MX `0 .`) sail through both,
plus review. Only a real (or `--check`) call against the API surfaces them.
- → Treat a **check-mode run against the real API as a required gate** for such roles, or
build a render-only assertion that materializes and inspects the rendered module args.

View file

@ -28,3 +28,7 @@ Everything else is reached over LAN/mesh and never appears here.
The zone is reconciled **additively** plus an explicit `absent` list (Gandi seeds 13 The zone is reconciled **additively** plus an explicit `absent` list (Gandi seeds 13
default records on a new `.me`; we purge the unwanted 11 and overwrite MX/SPF with the default records on a new `.me`; we purge the unwanted 11 and overwrite MX/SPF with the
anti-spoof baseline). Full-zone authoritative pruning is a future enhancement (TODO 8.3). anti-spoof baseline). Full-zone authoritative pruning is a future enhancement (TODO 8.3).
**Gandi rejects RFC-7505 null-MX (`0 .`)** with "invalid format for MX record" — so a
no-mail domain can't use the standard null-MX. We instead **remove the MX entirely** (no
MX + no apex A = no mail) and rely on SPF `-all` + DMARC `reject` to prevent spoofing.