docs: mark M1 applied (STATUS); log item.values + Gandi null-MX gotchas

M1 public_dns applied to wingu.me (purge + SPF/DMARC, idempotent). Friction:
item.values dict-method collision, Gandi null-MX rejection, and the apply=false-
Molecule/data-only-pytest gap that let both bugs reach a live apply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
sjat 2026-06-14 10:58:03 +02:00
parent 76bd1d63fc
commit 993d7885e4
2 changed files with 23 additions and 1 deletions

View file

@ -28,7 +28,7 @@ _Last reviewed: 2026-06-11._
| Tag standard + enforcement (ADR-019) | Works — `tests/tags.yml` (closed vocabulary) + `scripts/check-tags.py` (run by `make lint`, unit-tested): enforces the tag vocabulary and that each role import in a play's `roles:` block carries its role-name tag. Governs mostly-unbuilt roles, but the linter is live now. Proxmox VM tag convention (`<env>`, group, `managed-by=terraform`) is in the Terraform HCL but unprovisioned. |
| `roles/dev_env/` — interactive developer environment | **Built + applied.** zsh + oh-my-zsh + oh-my-posh, tmux + TPM plugins, neovim; dotfiles deployed via GNU stow (re-derived from V4/fisi per ADR-013). Node.js from a pinned upstream tarball (not Debian's npm). Lint + Molecule (idempotent) green. **Applied to `ubongo`** for users `sjat` + `claude` (verified: zsh login shells, stow-symlinked `.zshrc`/`.tmux.conf` + nvim config, oh-my-zsh, tmux plugins; nvim v0.12.2, oh-my-posh 29.0.1). Run via `playbooks/workstation.yml` against the `control` group (no dedicated `workstations` group yet). |
| `make check` / `make deploy PLAYBOOK=<name>` | **Works.** First end-to-end run (applying `dev_env`) surfaced + fixed latent bugs: Makefile `PLAYBOOK` var collision (binary path vs playbook-name arg) meant the targets never ran; `ansible.cfg` referenced uninstalled community.general callbacks (now built-in `default` + `ansible.posix.profile_tasks`); `acl` package added so Ansible can `become_user` an unprivileged user. The make targets now function — though `site`/`base`/`docker_host` content is still incomplete (see below). |
| `roles/public_dns/` + `playbooks/dns.yml` | **Built — not yet applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (null MX, SPF `-all`, DMARC reject), and the Gandi-defaults purge list are defined + unit-tested (`tests/test_public_dns.py`). The live `make deploy PLAYBOOK=dns` (purge + baseline) is **pending — run on ubongo**. M1 of the roadmap. |
| `roles/public_dns/` + `playbooks/dns.yml` | **Built + applied.** Manages wingu.me at Gandi LiveDNS as code (`community.general.gandi_livedns`, PAT from `vault.gandi.pat`); record data, anti-spoof baseline (SPF `-all` + DMARC reject), and the Gandi-defaults purge are defined + unit-tested (`tests/test_public_dns.py`). **Applied to wingu.me (2026-06-14):** purged Gandi's 13 seeded defaults; zone now holds only the SPF + DMARC TXT records; idempotent re-run clean. No null-MX (Gandi rejects `0 .`) — the MX is removed, so no MX + no apex A = no mail. M1 of the roadmap. |
| `ubongo` — physical control / AI-worker host (ADR-015) | **Built (partial).** Debian 13.5 on a Lenovo M70q (i3-10100T, 16 GB, 256 GB SSD; no disk encryption — accepted risk). Full toolchain installed + pinned to `fisi` (Docker 29.5.3, rbw 1.15.0, Claude Code 2.1.173, ansible-core 2.17.14 + molecule via `make setup`/`make collections`). Repo cloned under a dedicated `claude` user (docker group, no sudo). Vault works via rbw (offline-cache decryption verified). SSH key-only (password + root login disabled). In the production inventory `control` group at 10.20.10.151. **`dev_env` now applied here** (zsh/tmux/nvim for `sjat` + `claude`, via `playbooks/workstation.yml`). Managed as the operator account `sjat` (`group_vars/control` sets `ansible_user: sjat`), not the `ansible` service user `group_vars/all` assumes — ubongo has no bootstrapped `ansible` user. **Pending:** NetBird mesh enrollment (so SSH is LAN-only); full `base` hardening (only the `firewall` concern exists, and it is NOT applied here — applying default-deny with no mesh would lock out inbound SSH on the physical NIC); proper `ansible`-user bootstrap (currently managed as `sjat`); OPNsense DHCP reservation for 10.20.10.151 (MAC `88:a4:c2:e0:ee:da`); Terraform state backup (no TF state yet). |
## Scaffolded but empty — NOT implemented

View file

@ -21,6 +21,28 @@ earning its keep.
_(append new raw signals here; the next kaizen review consumes them)_
- `[gotcha]` **`item.values` in a loop sends the dict's `.values()` METHOD, not the
key** (2026-06-14): the `public_dns` role looped over records that have a `values:`
key and used `{{ item.values }}` in the `gandi_livedns` task. Jinja attribute access
resolved `item.values` to the built-in dict method, so Gandi received
`"<built-in method values of dict object at 0x...>"` as the live TXT value — corrupt
**and** non-idempotent (the address changes each run → always "changed"). The fix is
bracket-indexing: `item['values']` (same risk for any key named `keys`/`items`/`get`/
`update`/...). → convention: in loops, index loop-var keys with `item['key']`, never
`item.key`; consider an ansible-lint guard.
- `[gotcha]` **Gandi LiveDNS rejects RFC-7505 null-MX `0 .`** (2026-06-14): "invalid
format for MX record." Used "no MX + no apex A" + SPF `-all` + DMARC reject instead.
Minor, but worth a note for any future no-mail domain on Gandi.
- `[recurring]` **apply=false Molecule + data-only pytest leave a real gap for
API/templating roles** (2026-06-14): both the null-MX and the `item.values` bugs sailed
through the spec, BOTH review subagents, the pytest (validates the data file, not the
rendered template), and the Molecule scenario (`apply=false`, so the API tasks never
run) — only the **live `make check`/`deploy`** against the real Gandi API surfaced them.
For roles whose payload is "render data → external API call", the rendered template is
the thing that breaks, and nothing short of a real (or check-mode) API call exercises it.
→ for such roles, treat a check-mode run against the real API as a required gate, not an
optional final step; or build a render-only assertion that materializes the module args.
- `[recurring]` **Execution-mode menu asked AGAIN despite the 2026-06-10 "mechanical
fix"** (2026-06-14): at the M1 (`public_dns`) plan handoff I presented the "1.
Subagent-Driven / 2. Inline Execution — which approach?" menu and asked the user to