boma/docs/testing/gotchas.md
sjat a0762c563e docs(kaizen): bind-mount gotcha + consume 7 signals into the ledger (2026-06-17)
Migrate the single-file-bind-mount/stale-config gotcha (reload-in-place needs a
directory mount; restart-based roles don't) to docs/testing/gotchas.md, and move
all 7 open signals out of FRICTION.md's Open-signals section into the new
2026-06-17 decisions-ledger block: all consumed, 1 PARK (the ubongo
self-management gap, tracked in STATUS), 0 REMOVE. Relax test_load_signals to
accept an empty Open-signals section (the goal state after a kaizen pass).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 17:50:17 +02:00

5.7 KiB

Testing & Molecule gotchas

Durable, point-of-use knowledge for writing and running role tests (ADR-008). Migrated from docs/FRICTION.md by the 2026-06-10 kaizen review. Append here when a testing surprise is worth remembering past the session that hit it.

nftables / nft -c render checks

  • nft -c rejects iif "<name>" when the interface is absentiif resolves to an interface index at load time, so it fails in the Molecule container and would fail identically on any real host before the interface exists (e.g. wt0 before NetBird is up). Use iifname "<name>" (string match, no existence requirement, survives the interface coming and going) for any interface that may be absent.
  • The render-and-nft -c (no-apply) Molecule approach earns its keep — it caught the iif/iifname bug deterministically without touching the host kernel. Reuse this pattern (render template → static-check, never apply) for other config-rendering roles.

Molecule (community.docker)

  • Molecule's community.docker connection uses ansible_host as the container name (remote_addr). Setting ansible_host as data in a scenario's host_vars (e.g. to give a resolver a fake IP) breaks the connection → UNREACHABLE / "Failed to create temporary directory". Don't override ansible_host in Molecule; feed fixture IPs another way (keep fixtures to zone sources and unit-test IP resolution).

Coverage blind spot: apply-only task paths

  • Apply-only task paths have no Level-1 coverage, so safety bugs hide there. Example: an nft auto-rollback snapshot used a bare nft list ruleset (no leading flush ruleset), so the revert was a silent no-op on first apply and errored on later ones — the whole safety net was dead. Molecule never runs the apply (gated off), so only adversarial review + an isolated-netns round-trip test caught it. → For apply/safety paths Molecule can't exercise, validate out-of-band (a throwaway --privileged container with its own netns) and treat a final adversarial review as mandatory, not optional.

Tags on dynamic include_tasks need apply: to reach the included tasks

  • A tag on a dynamic include_tasks selects the include statement, not its contents. Tagging include_tasks: x.yml with concern and running --tags concern runs nothing (ok=N changed=0) unless the included tasks are independently tagged. Use include_tasks: {file: x.yml, apply: {tags: [concern]}} to propagate the tag onto the included tasks — mandatory whenever a role uses tags to apply concern-subsets (roles/base/tasks/main.yml and roles/dev_env/tasks/main.yml are the references).
  • Molecule converges untagged, so it cannot catch this by default — the bug only shows under make deploy … TAGS=<concern> on a real host (first hit live on askari, M3). See the tag-isolation pattern below to catch it in Molecule instead.
  • Check-mode artifact: a service/handler for a not-yet-installed package fails in a first-run --check; guard with when: not ansible_check_mode.

Testing concern-tag isolation in Molecule

  • To catch the tag-propagation bug above in Molecule, add a second converge play that applies one concern to a fresh target — include_role with apply: {tags: [config]} — plus a verify assertion that the concern's effect landed. Drive the real partial path with molecule converge -- --tags config.
  • Sequence matters: a partial-tag run on a fresh instance fails on cross-concern deps (a config task may need a binary the packages concern installs). The realistic test is full converge → partial --tags re-run (idempotent). Harness pre_tasks (e.g. test-user creation) must be tagged always, or --tags filters them out. (Pattern proven on dev_env, 2026-06-14.)

API / templating roles: render-only tests miss the real call

  • For a role whose payload is "render data → external API call" (e.g. public_dns → Gandi LiveDNS), apply=false Molecule + data-only pytest exercise the data file, not the rendered module args — so corrupt-template and API-rejection bugs (item.values resolving to a dict method; Gandi rejecting RFC-7505 null-MX 0 .) sail through both, plus review. Only a real (or --check) call against the API surfaces them.
  • → Treat a check-mode run against the real API as a required gate for such roles, or build a render-only assertion that materializes and inspects the rendered module args.

Single-file bind mount + atomic rewrite = stale config (reload-in-place only)

  • ansible.builtin.template writes atomically (temp file + rename → a new inode). A Docker single-file bind mount pins the old inode, so a container that reloads config in place (no restart) keeps reading the stale file. Live hit: reverse_proxy bind-mounted the Caddyfile as a single file; caddy reload (in-container) re-read the old inode and silently no-op'd ("config is unchanged"). The new NetBird route never loaded → Caddy never requested its cert → surfaced only as a downstream TLS handshake failure.
  • Fix for reload-in-place roles: bind-mount the config directory, not the file (./caddy/etc/caddy). Directory mounts reflect the inode swap, so the reload sees the new file (proven on askari).
  • Restart-based roles are fine with a single-file mount. Sibling case: netbird single-file-mounts config.yaml, but its handler does docker compose restart (not an in-container reload), and a restart re-resolves the bind mount (verified: route count 0 before, 1 after). Rule of thumb: reload-in-place needs a directory mount; restart-based roles don't.