boma/docs/superpowers/plans/2026-06-18-local-vm-integration-testing.md
sjat 65533be4d9 docs(plan): implementation plan for local VM integration testing (2.4)
20-task TDD plan: integration_test substrate role, stdlib virsh driver, askari profile, tiered certs, RED->GREEN acceptance, docker_host container-forward fix, ADR-025 + docs. Follows the 2026-06-18 design spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 11:56:04 +02:00

49 KiB
Raw Blame History

Local VM Integration Testing Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Give the agent a make test-integration HOST=<name> loop that boots a throwaway KVM VM on ubongo mirroring a real host, applies the real playbooks, performs a real reboot, and asserts outcomes — catching the reboot/firewall/Docker class Molecule cannot (the 2026-06-17 incident).

Architecture: A non-service integration_test role installs the libvirt/QEMU substrate on ubongo. A stdlib-only driver scripts/integration-vm.py orchestrates the lifecycle over virsh/virt-install/cloud-localds (golden Debian-13 image → ephemeral qcow2 overlay → cloud-init seed → boot → apply real playbooks via a single-host transient inventory → reboot → verify playbook → teardown). Stubs and cert-tiers are passed as Ansible -e @file extra-vars so the real inventory is never edited and the driver never parses YAML.

Tech Stack: Debian 13 (trixie), libvirt 11.3 / virt-install 5.0.0 / QEMU-KVM, cloud-init NoCloud (cloud-image-utils 0.33), Ansible, Caddy v2 (DNS-01 via the existing caddy-gandi image), Python 3 stdlib, pytest, Molecule (Docker).

Verified facts (ADR-014, 2026-06-18):

  • Image: https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2 + SHA512SUMS alongside. Ships cloud-init; no qemu-guest-agent → get IP via virsh domifaddr <dom> --source lease.
  • Seed: cloud-localds seed.img user-data [meta-data] (cloud-image-utils). Label cidata.
  • virt-install --import --disk path=...,format=qcow2 --disk path=seed.img,device=cdrom --network network=<net> --osinfo debian13 --graphics none --serial file,path=<log> --noautoconsole (package virt-install; virtinst is a transitional shim).
  • Isolated NAT net via virsh net-define/net-start/net-autostart (own bridge+subnet, <forward mode='nat'/>).
  • Caddy: acme_ca https://acme-staging-v02.api.letsencrypt.org/directory (global), tls internal (self-signed), tls { dns gandi {env.GANDI_BEARER_TOKEN} } (DNS-01; module already compiled into the boma caddy-gandi image). LE staging limits are effectively unlimited; use staging for routine cert tests.

Repo facts this plan extends:

  • roles/base/templates/nftables.conf.j2:21chain forward { ... policy drop; }; line 26 include "{{ base__firewall_dropin_dir }}/*.nft"; base__firewall_dropin_dir: /etc/nftables.d. The drop-in include already existsdocker_host just needs to ship a .nft file.
  • base__firewall_apply gates application (roles/base/tasks/firewall.yml:32-35).
  • roles/docker_host/ installs Docker only; no container-forward rules (the green-half fix).
  • roles/reverse_proxy/templates/Caddyfile.j2 — global acme_dns gandi {env.GANDI_BEARER_TOKEN} when reverse_proxy__acme_dns_provider == 'gandi'; per-site blocks; Gandi PAT via vault.gandi.patenv.j2 GANDI_BEARER_TOKEN. No acme_ca or tls internal knob yet (this plan adds them).
  • askari: inventories/production/offsite.yml (ansible_host: 77.42.120.136, group offsite_hosts); group_vars/offsite_hosts/vars.yml (base__firewall_apply: false, base__ssh_listen_mesh_only: false); routes in group_vars/all/reverse_proxy.yml.
  • playbooks/site.yml (base→all, docker_host→docker_hosts) + playbooks/offsite.yml (docker_host→reverse_proxy→netbird_coordinator on offsite_hosts).
  • Makefile vars: VENV PLAYBOOK_BIN INVENTORY VAULT_ARGS ROLE PLAYBOOK LIMIT TAGS. pytest in tests/test_*.py (no conftest/pytest.ini; importlib-load of hyphenated scripts, see tests/test_firewall_rules.py:1-13). Tag vocabulary tests/tags.yml; scripts/check-tags.py run by make lint.
  • None of roles/integration_test/, scripts/integration-vm.py, tests/integration/ exist.

File Structure

Create:

  • roles/integration_test/ — substrate role (defaults, tasks, handlers, meta, README, molecule/default/{molecule,converge,verify}.yml). Installs libvirt/QEMU/virt-install/cloud-image-utils; enables libvirtd; adds sjat/claude to libvirt+kvm groups; creates the image cache dir.
  • scripts/integration-vm.py — stdlib-only driver. Pure helpers + impure orchestration + argparse CLI.
  • tests/test_integration_vm.py — pytest for the driver's pure helpers.
  • tests/integration/profiles/askari.json — driver-side profile metadata (groups, playbook+tags list, extra-vars files, mem/vcpu).
  • tests/integration/overrides/askari.yml — Ansible stub extra-vars (firewall on, ssh break-glass).
  • tests/integration/certs/{internal,le-staging,le-prod-wildcard}.yml — cert-tier extra-vars.
  • tests/integration/verify.yml — outcome-based verify playbook.
  • tests/integration/README.md — how the harness works.
  • docs/decisions/025-local-vm-integration-testing.md — ADR.
  • docs/runbooks/integration-testing.md — operator/agent runbook.

Modify:

  • roles/reverse_proxy/defaults/main.yml + templates/Caddyfile.j2 — add reverse_proxy__tls_internal + reverse_proxy__acme_ca knobs.
  • roles/docker_host/defaults/main.yml + tasks/main.yml + new templates/10-docker-forward.nft.j2 — the container-forward drop-in (green-half).
  • Makefiletest-integration, test-integration-clean targets.
  • .gitignoretests/integration/.run/, /integration-runs/ is under $HOME (already outside repo).
  • docs/decisions/008-testing.md, 015-control-host.md; docs/security/accepted-risks.md; CLAUDE.md; STATUS.md; docs/TODO.md; docs/hardware/reference.md — pointers/entries.

Milestones: RED (Task 15: harness reproduces the incident) → GREEN (Task 16: docker_host fix survives reboot) → le-staging cert tier (Task 17) → governance/docs (Tasks 18-20).


Phase A — Substrate role

Task 1: integration_test role (libvirt/QEMU substrate)

Files:

  • Create: roles/integration_test/{defaults,tasks,handlers,meta}/main.yml, roles/integration_test/README.md, roles/integration_test/molecule/default/{molecule,converge,verify}.yml

  • Step 1: Scaffold

Run: make new-role NAME=integration_test Expected: Role integration_test scaffolded at roles/integration_test/

  • Step 2: defaults/main.yml
---
# integration_test — installs the local KVM/libvirt substrate on the control node
# (ubongo) so the agent can run throwaway VM integration tests (ADR-025). Non-service
# role; applied to the `control` group. Not a production hypervisor (ADR-015).
integration_test__packages:
  - qemu-system-x86      # KVM
  - qemu-utils           # qemu-img (overlays)
  - libvirt-daemon-system
  - libvirt-clients      # virsh
  - virt-install         # virt-install (trixie: the real pkg; `virtinst` is transitional)
  - cloud-image-utils    # cloud-localds (NoCloud seed)
  - genisoimage          # cloud-localds fallback
# Users granted libvirt/kvm access (run VMs without sudo).
integration_test__users:
  - sjat
  - claude
# Where the golden image + overlays live (outside the repo).
integration_test__cache_dir: "/var/lib/boma-integration"
  • Step 3: tasks/main.yml
---
- name: Install the KVM/libvirt substrate
  ansible.builtin.apt:
    name: "{{ integration_test__packages }}"
    state: present
    update_cache: true
  tags: [packages]

- name: Enable and start libvirtd
  ansible.builtin.systemd:
    name: libvirtd
    enabled: true
    state: started
  tags: [config]

- name: Grant users libvirt + kvm access
  ansible.builtin.user:
    name: "{{ item }}"
    groups: [libvirt, kvm]
    append: true
  loop: "{{ integration_test__users }}"
  tags: [users]

- name: Create the integration cache dir
  ansible.builtin.file:
    path: "{{ integration_test__cache_dir }}"
    state: directory
    owner: root
    group: libvirt
    mode: "2775"
  tags: [config]
  • Step 4: meta/main.yml (mirror roles/dev_env/meta/main.yml: author sjat, Debian/trixie, min_ansible_version: "2.17", dependencies: [], description naming ADR-025). handlers/main.yml stays --- (no handlers). README.md: purpose, that it targets the control group, links ADR-025/ADR-015.

  • Step 5: molecule/default/molecule.yml — copy roles/dev_env/molecule/default/molecule.yml verbatim (same Debian-13 systemd image).

  • Step 6: molecule/default/converge.yml

---
- name: Converge
  hosts: all
  become: true
  gather_facts: true
  roles:
    - role: integration_test
  • Step 7: molecule/default/verify.yml (assert install tasks — NOT libvirtd active, which cannot run KVM-in-Docker)
---
- name: Verify
  hosts: all
  become: true
  gather_facts: false
  tasks:
    - name: Gather package facts
      ansible.builtin.package_facts:
    - name: Assert the substrate packages are installed
      ansible.builtin.assert:
        that:
          - "'libvirt-clients' in ansible_facts.packages"
          - "'virt-install' in ansible_facts.packages"
          - "'cloud-image-utils' in ansible_facts.packages"
          - "'qemu-system-x86' in ansible_facts.packages"
    - name: Cache dir exists
      ansible.builtin.stat:
        path: /var/lib/boma-integration
      register: _cache
    - name: Assert cache dir
      ansible.builtin.assert:
        that: [_cache.stat.isdir]
  • Step 8: Add the role to the control-node play. Edit playbooks/workstation.yml (the control-node playbook that applies dev_env) to also import integration_test for control. Confirm the exact play first:

Run: grep -n "dev_env\|hosts:\|control" playbooks/workstation.yml Then add under the same control play's roles:

    - role: integration_test
      tags: [integration_test]
  • Step 9: Lint + Molecule

Run: make lint Expected: clean (new role-name tag integration_test auto-accepted by check-tags; concern tags packages/config/users are in tests/tags.yml). Run: make test ROLE=integration_test Expected: converge + idempotence + verify PASS.

  • Step 10: Commit
git add roles/integration_test playbooks/workstation.yml
git commit -m "feat(integration_test): KVM/libvirt substrate role on the control node"

Phase B — Driver: pure helpers (TDD)

Task 2: Driver skeleton + constants + CLI dispatch

Files:

  • Create: scripts/integration-vm.py

  • Test: tests/test_integration_vm.py

  • Step 1: Write the failing test (tests/test_integration_vm.py)

import importlib.util
import pathlib

_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "integration-vm.py"
_spec = importlib.util.spec_from_file_location("integration_vm", _PATH)
ivm = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(ivm)


def test_valid_tiers():
    assert ivm.VALID_TIERS == ("internal", "le-staging", "le-prod-wildcard")
  • Step 2: Run it — fails (file missing)

Run: .venv/bin/pytest tests/test_integration_vm.py -q Expected: FAIL (cannot load scripts/integration-vm.py).

  • Step 3: Create the skeleton (scripts/integration-vm.py)
#!/usr/bin/env python3
"""boma local-VM integration test harness driver (ADR-025).

Stdlib-only by convention (TODO-14): never imports a YAML library. The transient
inventory is emitted via string templates; stubs/cert-tiers reach Ansible as
`-e @<file>` extra-vars; profile metadata is JSON. Talks to libvirt via `virsh`.
"""
import argparse
import hashlib
import json
import os
import pathlib
import re
import shutil
import subprocess
import sys
import time
import urllib.request
import uuid

REPO_ROOT = pathlib.Path(__file__).resolve().parent.parent
CACHE_DIR = pathlib.Path(os.environ.get("BOMA_IT_CACHE", "/var/lib/boma-integration"))
IMAGE_URL = "https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2"
SHA_URL = "https://cloud.debian.org/images/cloud/trixie/latest/SHA512SUMS"
IMAGE_NAME = "debian-13-genericcloud-amd64.qcow2"
NET_NAME = "boma-it"
NET_XML = """<network>
  <name>boma-it</name>
  <forward mode='nat'/>
  <bridge name='virbr-boma' stp='on' delay='0'/>
  <ip address='192.168.150.1' netmask='255.255.255.0'>
    <dhcp><range start='192.168.150.10' end='192.168.150.254'/></dhcp>
  </ip>
</network>
"""
NAME_PREFIX = "boma-it-"
RUN_DIR = REPO_ROOT / "tests" / "integration" / ".run"
DIAG_ROOT = pathlib.Path.home() / "integration-runs"
PROFILE_DIR = REPO_ROOT / "tests" / "integration" / "profiles"
INTEG_DIR = REPO_ROOT / "tests" / "integration"
CERT_DIR = REPO_ROOT / "tests" / "integration" / "certs"
DEFAULT_MEM_MIB = 3072
DEFAULT_VCPUS = 2
MIN_FREE_MIB = 4096
VALID_TIERS = ("internal", "le-staging", "le-prod-wildcard")


def main(argv=None):
    p = argparse.ArgumentParser(prog="integration-vm", description=__doc__)
    sub = p.add_subparsers(dest="cmd", required=True)
    for c in ("up", "apply", "reboot", "assert", "cycle", "down", "console"):
        sp = sub.add_parser(c)
        sp.add_argument("--host", required=True)
        sp.add_argument("--certs", choices=VALID_TIERS, default="internal")
        sp.add_argument("--keep", action="store_true")
        sp.add_argument("--no-reboot", action="store_true")
    sub.add_parser("prune")
    args = p.parse_args(argv)
    return DISPATCH[args.cmd](args)


if __name__ == "__main__":  # pragma: no cover
    sys.exit(main())

(Define DISPATCH = {...} after the command functions in later tasks; for now add a temporary DISPATCH = {} above main so import succeeds.)

  • Step 4: Run — passes

Run: .venv/bin/pytest tests/test_integration_vm.py -q Expected: PASS.

  • Step 5: Commit
git add scripts/integration-vm.py tests/test_integration_vm.py
git commit -m "feat(integration-vm): driver skeleton + CLI dispatch"

Task 3: vm_name, free_mib, parse_lease_ip (TDD)

Files: Modify scripts/integration-vm.py, tests/test_integration_vm.py

  • Step 1: Write failing tests
def test_vm_name_prefix_and_suffix():
    assert ivm.vm_name("askari", "ab12cd34") == "boma-it-askari-ab12cd34"

def test_vm_name_generates_suffix():
    n = ivm.vm_name("askari")
    assert n.startswith("boma-it-askari-") and len(n.split("-")[-1]) == 8

def test_free_mib_parses_memavailable():
    sample = "MemTotal:       16331156 kB\nMemAvailable:    8388608 kB\n"
    assert ivm.free_mib(sample) == 8192

def test_parse_lease_ip_extracts_ipv4():
    out = (" Name       MAC address          Protocol     Address\n"
           "-------------------------------------------------------------------\n"
           " vnet0      52:54:00:aa:bb:cc    ipv4         192.168.150.42/24\n")
    assert ivm.parse_lease_ip(out) == "192.168.150.42"

def test_parse_lease_ip_none_when_absent():
    assert ivm.parse_lease_ip("no leases\n") is None
  • Step 2: Run — fail. .venv/bin/pytest tests/test_integration_vm.py -q → FAIL (no attrs).

  • Step 3: Implement (add to scripts/integration-vm.py)

def vm_name(host, suffix=None):
    suffix = suffix or uuid.uuid4().hex[:8]
    return f"{NAME_PREFIX}{host}-{suffix}"


def free_mib(meminfo_text):
    m = re.search(r"^MemAvailable:\s+(\d+)\s+kB", meminfo_text, re.MULTILINE)
    return int(m.group(1)) // 1024 if m else 0


def parse_lease_ip(domifaddr_output):
    m = re.search(r"ipv4\s+(\d+\.\d+\.\d+\.\d+)", domifaddr_output)
    return m.group(1) if m else None
  • Step 4: Run — pass. .venv/bin/pytest tests/test_integration_vm.py -q → PASS.

  • Step 5: Commit. git commit -am "feat(integration-vm): vm naming, RAM guard, lease IP parsing"

Task 4: cloud-init render_meta_data / render_user_data (TDD)

Files: Modify driver + tests

  • Step 1: Write failing tests
def test_meta_data_has_instance_and_hostname():
    md = ivm.render_meta_data("iid-askari-x", "boma-it-askari-x")
    assert "instance-id: iid-askari-x" in md
    assert "local-hostname: boma-it-askari-x" in md

def test_user_data_injects_key_and_ansible_user():
    ud = ivm.render_user_data("ssh-ed25519 AAAA... claude@ubongo", "ansible")
    assert ud.startswith("#cloud-config")
    assert "name: ansible" in ud
    assert "ssh-ed25519 AAAA... claude@ubongo" in ud
    assert "NOPASSWD:ALL" in ud
  • Step 2: Run — fail.

  • Step 3: Implement

def render_meta_data(instance_id, hostname):
    return f"instance-id: {instance_id}\nlocal-hostname: {hostname}\n"


def render_user_data(ssh_pubkey, ansible_user):
    return (
        "#cloud-config\n"
        "users:\n"
        f"  - name: {ansible_user}\n"
        "    sudo: 'ALL=(ALL) NOPASSWD:ALL'\n"
        "    shell: /bin/bash\n"
        "    ssh_authorized_keys:\n"
        f"      - {ssh_pubkey}\n"
        "ssh_pwauth: false\n"
        "package_update: false\n"
    )
  • Step 4: Run — pass.

  • Step 5: Commit. git commit -am "feat(integration-vm): cloud-init user-data/meta-data rendering"

Task 5: cert_file, profile_path, render_run_hosts (TDD)

Files: Modify driver + tests

  • Step 1: Write failing tests
def test_cert_file_valid_tier():
    p = ivm.cert_file("le-staging")
    assert p.name == "le-staging.yml" and p.parent.name == "certs"

def test_cert_file_rejects_bad_tier():
    import pytest
    with pytest.raises(ValueError):
        ivm.cert_file("bogus")

def test_render_run_hosts_single_host_in_groups():
    out = ivm.render_run_hosts("boma-it-askari-x", "192.168.150.42",
                               "ansible", ["offsite_hosts"])
    assert "offsite_hosts:" in out
    assert "boma-it-askari-x:" in out
    assert "ansible_host: 192.168.150.42" in out
    assert "ansible_user: ansible" in out
    # invariant: the real askari host must NOT appear
    assert "askari:" not in out.replace("boma-it-askari-x:", "")
  • Step 2: Run — fail.

  • Step 3: Implement

def cert_file(tier):
    if tier not in VALID_TIERS:
        raise ValueError(f"unknown cert tier: {tier}")
    return CERT_DIR / f"{tier}.yml"


def profile_path(host):
    return PROFILE_DIR / f"{host}.json"


def render_run_hosts(name, ip, ansible_user, groups):
    lines = [
        "# Generated by scripts/integration-vm.py — transient, gitignored. Do not edit.",
        "# Single test host ONLY (safety invariant: no real host is ever in scope).",
        "all:",
        "  children:",
    ]
    for g in groups:
        lines += [
            f"    {g}:",
            "      hosts:",
            f"        {name}:",
            f"          ansible_host: {ip}",
            f"          ansible_user: {ansible_user}",
        ]
    return "\n".join(lines) + "\n"
  • Step 4: Run — pass.

  • Step 5: Commit. git commit -am "feat(integration-vm): cert-tier + profile + transient inventory rendering"


Phase C — Driver: orchestration (impure)

Task 6: sh helper + ensure_image

Files: Modify driver

  • Step 1: Implement the subprocess helper + image fetch
def sh(cmd, check=True, capture=False, **kw):
    """Run a command (list form). Logs the command to stderr."""
    print("+ " + " ".join(str(c) for c in cmd), file=sys.stderr)
    return subprocess.run(cmd, check=check,
                          capture_output=capture, text=True, **kw)


def _expected_sha(sha_text, filename):
    for line in sha_text.splitlines():
        parts = line.split()
        if len(parts) == 2 and parts[1].lstrip("*") == filename:
            return parts[0]
    return None


def ensure_image():
    CACHE_DIR.mkdir(parents=True, exist_ok=True)
    img = CACHE_DIR / IMAGE_NAME
    if img.exists():
        return img
    print(f"Downloading {IMAGE_URL} ...", file=sys.stderr)
    tmp = img.with_suffix(".part")
    urllib.request.urlretrieve(IMAGE_URL, tmp)
    sha_text = urllib.request.urlopen(SHA_URL).read().decode()
    want = _expected_sha(sha_text, IMAGE_NAME)
    if not want:
        tmp.unlink(missing_ok=True)
        raise SystemExit(f"checksum for {IMAGE_NAME} not found at {SHA_URL}")
    h = hashlib.sha512()
    with open(tmp, "rb") as fh:
        for chunk in iter(lambda: fh.read(1 << 20), b""):
            h.update(chunk)
    if h.hexdigest() != want:
        tmp.unlink(missing_ok=True)
        raise SystemExit("golden image SHA512 mismatch — refusing to use it")
    tmp.rename(img)
    return img
  • Step 2: Manual verification

Run: .venv/bin/python scripts/integration-vm.py prune (after Task 10 adds prune; for now) — or test ensure_image directly:

.venv/bin/python -c "import importlib.util,pathlib; \
s=importlib.util.spec_from_file_location('ivm','scripts/integration-vm.py'); \
m=importlib.util.module_from_spec(s); s.loader.exec_module(m); print(m.ensure_image())"

Expected: downloads to /var/lib/boma-integration/debian-13-genericcloud-amd64.qcow2, SHA512 verified, prints the path. (Requires Task 1's role applied so the cache dir is group-writable, or run with sudo once.)

  • Step 3: Commit. git commit -am "feat(integration-vm): golden image fetch + SHA512 verification"

Task 7: net_ensure, up (boot a VM)

Files: Modify driver

  • Step 1: Implement
def net_ensure():
    r = sh(["virsh", "net-info", NET_NAME], check=False, capture=True)
    if r.returncode != 0:
        xml = RUN_DIR / "net.xml"
        RUN_DIR.mkdir(parents=True, exist_ok=True)
        xml.write_text(NET_XML)
        sh(["virsh", "net-define", str(xml)])
        sh(["virsh", "net-autostart", NET_NAME])
    active = sh(["virsh", "net-info", NET_NAME], capture=True).stdout
    if "Active:         yes" not in active:
        sh(["virsh", "net-start", NET_NAME])


def _ssh_pubkey():
    for cand in ("id_ed25519.pub", "id_rsa.pub"):
        p = pathlib.Path.home() / ".ssh" / cand
        if p.exists():
            return p.read_text().strip()
    raise SystemExit("no SSH public key found in ~/.ssh")


def up(host, name=None, mem_mib=DEFAULT_MEM_MIB, vcpus=DEFAULT_VCPUS):
    free = free_mib(pathlib.Path("/proc/meminfo").read_text())
    if free < MIN_FREE_MIB:
        raise SystemExit(f"refusing to start: only {free} MiB free (< {MIN_FREE_MIB})")
    running = sh(["virsh", "list", "--name"], capture=True).stdout.split()
    if any(n.startswith(NAME_PREFIX) for n in running):
        raise SystemExit("an integration VM is already running (one at a time); "
                         "run `integration-vm prune` first")
    name = name or vm_name(host)
    img = ensure_image()
    net_ensure()
    RUN_DIR.mkdir(parents=True, exist_ok=True)
    overlay = RUN_DIR / f"{name}.qcow2"
    sh(["qemu-img", "create", "-f", "qcow2", "-F", "qcow2", "-b", str(img), str(overlay)])
    (RUN_DIR / "user-data").write_text(render_user_data(_ssh_pubkey(), "ansible"))
    (RUN_DIR / "meta-data").write_text(render_meta_data(f"iid-{name}", name))
    seed = RUN_DIR / f"{name}-seed.img"
    sh(["cloud-localds", str(seed), str(RUN_DIR / "user-data"), str(RUN_DIR / "meta-data")])
    DIAG_ROOT.mkdir(parents=True, exist_ok=True)
    console = DIAG_ROOT / f"{name}-console.log"
    sh(["virt-install", "--name", name, "--memory", str(mem_mib), "--vcpus", str(vcpus),
        "--import",
        "--disk", f"path={overlay},format=qcow2",
        "--disk", f"path={seed},device=cdrom",
        "--network", f"network={NET_NAME}",
        "--osinfo", "debian13",
        "--graphics", "none",
        "--serial", f"file,path={console}",
        "--noautoconsole"])
    ip = wait_for_ip(name)
    wait_for_ssh(ip, "ansible")
    (RUN_DIR / "current").write_text(f"{name}\n{ip}\n{host}\n")
    print(f"VM {name} up at {ip}")
    return name, ip


def wait_for_ip(name, timeout=120):
    end = time.time() + timeout
    while time.time() < end:
        out = sh(["virsh", "domifaddr", name, "--source", "lease"],
                 check=False, capture=True).stdout
        ip = parse_lease_ip(out)
        if ip:
            return ip
        time.sleep(4)
    raise SystemExit(f"timed out waiting for {name} to get a DHCP lease")


def wait_for_ssh(ip, user, timeout=180):
    end = time.time() + timeout
    while time.time() < end:
        r = sh(["ssh", "-o", "StrictHostKeyChecking=no",
                "-o", "UserKnownHostsFile=/dev/null", "-o", "ConnectTimeout=5",
                f"{user}@{ip}", "true"], check=False, capture=True)
        if r.returncode == 0:
            return
        time.sleep(5)
    raise SystemExit(f"timed out waiting for SSH to {ip}")
  • Step 2: Manual smoke (real KVM — requires Task 1 applied to ubongo)
.venv/bin/python scripts/integration-vm.py up --host askari   # via DISPATCH once Task 10 lands

Expected: golden image present, boma-it net active, overlay + seed created, VM boots, prints VM boma-it-askari-<id> up at 192.168.150.x. SSH in: ssh ansible@<ip> works.

  • Step 3: Commit. git commit -am "feat(integration-vm): network + VM boot (overlay, cloud-init seed, virt-install import)"

Task 8: write_run_inventory, apply

Files: Modify driver

  • Step 1: Implement
def _read_current():
    txt = (RUN_DIR / "current").read_text().splitlines()
    return txt[0], txt[1], txt[2]   # name, ip, host


def write_run_inventory(name, ip, groups):
    RUN_DIR.mkdir(parents=True, exist_ok=True)
    (RUN_DIR / "hosts.yml").write_text(
        render_run_hosts(name, ip, "ansible", groups))
    link = RUN_DIR / "group_vars"
    target = REPO_ROOT / "inventories" / "production" / "group_vars"
    if link.is_symlink() or link.exists():
        if link.is_symlink():
            link.unlink()
    if not link.exists():
        link.symlink_to(target)


def apply(host, certs):
    name, ip, _ = _read_current()
    prof = json.loads(profile_path(host).read_text())
    write_run_inventory(name, ip, prof["groups"])
    extra = []
    for f in prof.get("extra_vars_files", []):
        extra += ["-e", f"@{INTEG_DIR / f}"]
    extra += ["-e", f"@{cert_file(certs)}"]
    for step in prof["applies"]:
        cmd = [".venv/bin/ansible-playbook", "-i", str(RUN_DIR) + "/",
               f"playbooks/{step['playbook']}", "--limit", name]
        if step.get("tags"):
            cmd += ["--tags", ",".join(step["tags"])]
        cmd += extra
        sh(cmd, cwd=str(REPO_ROOT))
    print(f"applied {host} profile to {name}")
  • Step 2: Manual verification — deferred to the Task 15 RED run (needs the profile/overlay/cert files from Phase D). Lint passes regardless.

  • Step 3: Commit. git commit -am "feat(integration-vm): transient inventory + real-playbook apply"

Task 9: reboot_vm, run_assert, dump_diagnostics

Files: Modify driver

  • Step 1: Implement
def reboot_vm():
    name, ip, _ = _read_current()
    sh(["virsh", "reboot", name])
    time.sleep(5)
    wait_for_ssh(ip, "ansible")
    print(f"{name} rebooted, SSH back at {ip}")


def run_assert(host, certs):
    name, ip, _ = _read_current()
    prof = json.loads(profile_path(host).read_text())
    write_run_inventory(name, ip, prof["groups"])
    extra = []
    for f in prof.get("extra_vars_files", []):
        extra += ["-e", f"@{INTEG_DIR / f}"]
    extra += ["-e", f"@{cert_file(certs)}"]
    cmd = [".venv/bin/ansible-playbook", "-i", str(RUN_DIR) + "/",
           "tests/integration/verify.yml", "--limit", name] + extra
    r = sh(cmd, cwd=str(REPO_ROOT), check=False)
    if r.returncode != 0:
        dump_diagnostics(name, ip)
        raise SystemExit(f"VERIFY FAILED for {name} — diagnostics in {DIAG_ROOT}")
    print(f"VERIFY PASSED for {name}")


def dump_diagnostics(name, ip):
    d = DIAG_ROOT / name
    d.mkdir(parents=True, exist_ok=True)
    for label, cmd in [
        ("nft", "nft list ruleset"),
        ("docker", "docker ps -a"),
        ("ss", "ss -tlnp"),
        ("journal", "journalctl -b --no-pager"),
        ("critical-chain", "systemd-analyze critical-chain"),
    ]:
        r = sh(["ssh", "-o", "StrictHostKeyChecking=no",
                "-o", "UserKnownHostsFile=/dev/null",
                f"ansible@{ip}", "sudo " + cmd], check=False, capture=True)
        (d / f"{label}.txt").write_text((r.stdout or "") + (r.stderr or ""))
    console = DIAG_ROOT / f"{name}-console.log"
    if console.exists():
        shutil.copy(console, d / "console.log")
    print(f"diagnostics written to {d}", file=sys.stderr)
  • Step 2: Commit. git commit -am "feat(integration-vm): reboot, verify run, failure diagnostics"

Task 10: down, prune, console, cycle + DISPATCH

Files: Modify driver

  • Step 1: Implement
def _destroy(name):
    sh(["virsh", "destroy", name], check=False)
    sh(["virsh", "undefine", name, "--nvram"], check=False)
    for f in RUN_DIR.glob(f"{name}*"):
        f.unlink(missing_ok=True)


def down(host=None, keep=False):
    if keep:
        print("--keep: leaving the VM running for inspection")
        return
    cur = RUN_DIR / "current"
    if cur.exists():
        name = cur.read_text().splitlines()[0]
        _destroy(name)
        cur.unlink(missing_ok=True)
        print(f"destroyed {name}")


def prune():
    running = sh(["virsh", "list", "--all", "--name"], capture=True).stdout.split()
    for n in running:
        if n.startswith(NAME_PREFIX):
            _destroy(n)
            print(f"pruned {n}")
    (RUN_DIR / "current").unlink(missing_ok=True)


def console():
    name = (RUN_DIR / "current").read_text().splitlines()[0]
    log = DIAG_ROOT / f"{name}-console.log"
    print(log.read_text() if log.exists() else f"no console log at {log}")


def cycle(host, certs, keep=False, no_reboot=False):
    try:
        up(host)
        apply(host, certs)
        if not no_reboot:
            reboot_vm()
        run_assert(host, certs)
    finally:
        # On success destroy; on failure (SystemExit) keep for inspection unless --keep flips it.
        if not keep:
            down(host)

Wire the dispatch (replace the temporary DISPATCH = {}):

DISPATCH = {
    "up": lambda a: (up(a.host), None)[1],
    "apply": lambda a: apply(a.host, a.certs),
    "reboot": lambda a: reboot_vm(),
    "assert": lambda a: run_assert(a.host, a.certs),
    "down": lambda a: down(a.host, a.keep),
    "console": lambda a: console(),
    "prune": lambda a: prune(),
    "cycle": lambda a: cycle(a.host, a.certs, a.keep, a.no_reboot),
}

Fix cycle's teardown semantics: on failure keep the VM (so it can be inspected); on success destroy. Implement by catching success explicitly:

def cycle(host, certs, keep=False, no_reboot=False):
    ok = False
    try:
        up(host); apply(host, certs)
        if not no_reboot:
            reboot_vm()
        run_assert(host, certs)
        ok = True
    finally:
        if ok and not keep:
            down(host)
        elif not ok:
            print("FAILED — VM left up for inspection; `integration-vm prune` to clean.",
                  file=sys.stderr)
  • Step 2: Run unit tests + lint. .venv/bin/pytest tests/test_integration_vm.py -q PASS; make lint clean.

  • Step 3: Commit. git commit -am "feat(integration-vm): teardown, prune, console, full cycle + dispatch"


Phase D — Profile, cert internal tier, verify playbook

Task 11: reverse_proxy tls internal + acme_ca knobs

Files: Modify roles/reverse_proxy/defaults/main.yml, roles/reverse_proxy/templates/Caddyfile.j2

  • Step 1: defaults — append:
# Integration-test / staging cert knobs (ADR-025). Default off = production behaviour.
reverse_proxy__tls_internal: false   # true => every site uses Caddy's self-signed CA
reverse_proxy__acme_ca: ""           # set to the LE staging directory URL to use staging
  • Step 2: Caddyfile.j2 — in the global options block (after the email line), add:
{% if reverse_proxy__acme_ca %}
  acme_ca {{ reverse_proxy__acme_ca }}
{% endif %}

In each site block (inside {{ r['host'] }} {), add as the first directive:

{% if reverse_proxy__tls_internal %}
  tls internal
{% endif %}
  • Step 3: Molecule regression — confirm reverse_proxy still renders. If the role has a Molecule scenario, run make test ROLE=reverse_proxy; else make lint. Expected: clean; default-off means production output is byte-identical (the {% if %} blocks emit nothing).

  • Step 4: Commit. git commit -am "feat(reverse_proxy): tls-internal + acme_ca knobs for integration/staging (ADR-025)"

Task 12: askari profile + overlay + cert-tier files

Files: Create tests/integration/profiles/askari.json, tests/integration/overrides/askari.yml, tests/integration/certs/{internal,le-staging,le-prod-wildcard}.yml

  • Step 1: profiles/askari.json
{
  "groups": ["offsite_hosts"],
  "applies": [
    {"playbook": "site.yml", "tags": ["base"]},
    {"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
  ],
  "extra_vars_files": ["overrides/askari.yml"],
  "mem_mib": 3072,
  "vcpus": 2
}

(netbird_coordinator is intentionally omitted from v1 applies — Caddy's published :443 gives the DNAT that reproduces FRICTION #1. Coordinator fidelity (#3/#4) is a follow-on, Task 21.)

  • Step 2: overrides/askari.yml (Ansible extra-vars; highest precedence — never edits real inventory)
---
# Integration-test overlay for the "askari" profile (ADR-025). Passed via `-e @`.
# Reproduces the 2026-06-17 incident: apply base's nftables default-deny to a Docker host.
base__firewall_apply: true
# Keep a break-glass: sshd stays on all interfaces (never wt0-only in a throwaway VM).
base__ssh_listen_mesh_only: false
# The VM is isolated; it must never touch the real mesh.
base__mesh_enabled: false
  • Step 3: cert-tier files

certs/internal.yml:

---
reverse_proxy__tls_internal: true

certs/le-staging.yml:

---
reverse_proxy__tls_internal: false
reverse_proxy__acme_dns_provider: gandi
reverse_proxy__acme_ca: "https://acme-staging-v02.api.letsencrypt.org/directory"

certs/le-prod-wildcard.yml:

---
# On-demand only. Records an accepted risk (ADR-025 / accepted-risks.md): the prod
# Gandi PAT reaches an ephemeral VM and transient TXT records land in the real wingu.me.
reverse_proxy__tls_internal: false
reverse_proxy__acme_dns_provider: gandi
reverse_proxy__acme_ca: ""
  • Step 4: Commit. git commit -am "feat(integration): askari profile, stub overlay, cert-tier files"

Task 13: verify playbook

Files: Create tests/integration/verify.yml

  • Step 1: Write it
---
# Integration verify (ADR-025). Outcome-based: proves Docker forwarding survives the
# reboot. The load-bearing check probes the VM's published :443 FROM the controller
# (ubongo) — if base's forward-drop killed DNAT, this times out (the FRICTION #1 bug).
- name: Verify the rebooted host
  hosts: all
  become: true
  gather_facts: false
  tasks:
    - name: Docker daemon is active
      ansible.builtin.command: systemctl is-active docker
      changed_when: false

    - name: Forward chain permits container traffic (drop-in loaded)
      ansible.builtin.command: nft list chain inet filter forward
      register: _fwd
      changed_when: false

    - name: Assert container forwarding is allowed (not pure drop)
      ansible.builtin.assert:
        that: "'accept' in _fwd.stdout"
        fail_msg: >-
          forward chain is pure drop — container forwarding will die on reboot
          (FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.

    - name: Published HTTPS port answers from the controller (DNAT + forward alive)
      delegate_to: localhost
      become: false
      ansible.builtin.uri:
        url: "https://{{ ansible_host }}/"
        validate_certs: false
        status_code: [200, 308, 404, 502, 503]
        timeout: 10
      register: _probe
      retries: 5
      delay: 6
      until: _probe is succeeded
  • Step 2: Lint. make lint — clean (file is under tests/, not playbooks/, but keep tags valid; this play uses none, which is fine).

  • Step 3: Commit. git commit -am "feat(integration): outcome-based verify playbook (DNAT-survives-reboot)"


Phase E — Makefile + RED milestone

Task 14: Makefile targets + .gitignore

Files: Modify Makefile, .gitignore

  • Step 1: Makefile — add after the test-all target:
test-integration:
ifndef HOST
	$(error HOST is required: make test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1])
endif
	PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py cycle \
	  --host $(HOST) $(if $(CERTS),--certs $(CERTS)) $(if $(KEEP),--keep)

test-integration-clean:
	PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py prune

Add both to .PHONY and the help block (match the existing style).

  • Step 2: .gitignore — add:
# Integration-test transient run dir (ADR-025); diagnostics live under ~/integration-runs
tests/integration/.run/
  • Step 3: Commit. git commit -am "feat(make): test-integration / test-integration-clean targets"

Task 15: RED milestone — reproduce the incident

Files: none (a validation run); record the outcome.

  • Step 1: Pre-flight — confirm rbw unlocked (the apply decrypts group_vars/all/vault.yml); confirm Task 1's role is applied to ubongo (virsh version works, you're in the libvirt group — may need a re-login).

  • Step 2: Run the cycle on TODAY's base (no docker_host fix yet)

Run: make test-integration HOST=askari Expected: VM boots → base (firewall on) + docker_host + reverse_proxy apply → reboot → verify FAILS at "Assert container forwarding is allowed" and/or the :443 probe times out. Diagnostics appear under ~/integration-runs/boma-it-askari-<id>/ (nft shows forward { policy drop } with no accepts; the published port is dead).

  • Step 3: Confirm the failure is the RIGHT one — read ~/integration-runs/<name>/nft.txt: the inet filter forward chain is pure policy drop. This is the faithful reproduction of FRICTION #1. If verify PASSES here, the harness is not faithful — stop and investigate (e.g. Docker re-added its own accepts, or the firewall didn't apply).

  • Step 4: Clean up. make test-integration-clean

  • Step 5: Record — append a [gotcha]/milestone note to docs/FRICTION.md Open signals: "ADR-025 harness reproduced the 2026-06-17 firewall×Docker×reboot bug on a local VM (RED). Diagnostics: nft forward pure-drop, :443 DNAT dead post-reboot." Commit:

git commit -am "test(integration): RED — harness reproduces the 2026-06-17 incident"

Phase F — GREEN milestone (docker_host fix)

Task 16: docker_host container-forward drop-in

Files: Modify roles/docker_host/defaults/main.yml, roles/docker_host/tasks/main.yml; Create roles/docker_host/templates/10-docker-forward.nft.j2

  • Step 1: defaults — append:
# Container-forward nftables drop-in (FRICTION 2026-06-17 #1 / ADR-025). base's inet
# filter forward chain is `policy drop`; a drop verdict there is final, so Docker's own
# ip-filter accepts can't save forwarded container traffic. We append accepts to base's
# forward chain via base's /etc/nftables.d/*.nft include. Only meaningful on hosts where
# base__firewall_apply is true.
docker_host__forward_dropin: true
  • Step 2: template templates/10-docker-forward.nft.j2
# {{ ansible_managed }}
# Allow container forwarding through base's default-deny forward chain (ADR-025).
table inet filter {
  chain forward {
    ct state established,related accept
    iifname "docker0" accept
    oifname "docker0" accept
    iifname "br-+" accept
    oifname "br-+" accept
  }
}
  • Step 3: tasks/main.yml — append (after Docker install):
- name: Install the container-forward nftables drop-in
  ansible.builtin.template:
    src: 10-docker-forward.nft.j2
    dest: "{{ base__firewall_dropin_dir }}/10-docker-forward.nft"
    mode: "0644"
  when: docker_host__forward_dropin | bool
  notify: reload nftables
  tags: [firewall]

Confirm the handler name base exposes: Run: grep -rn "listen:\|reload nftables\|nftables" roles/base/handlers/main.yml Use base's actual handler listen: topic; if none fits, add a docker_host handler that runs nft -f /etc/nftables.conf (the same reload base uses). Show the handler you add in roles/docker_host/handlers/main.yml:

---
- name: reload nftables
  ansible.builtin.command: nft -f /etc/nftables.conf
  listen: reload nftables
  • Step 4: GREEN run

Run: make test-integration HOST=askari Expected: apply (now includes the drop-in) → reboot → verify PASSES (forward chain has accept rules; :443 answers from ubongo). This is the red→green proof.

If it still fails, read diagnostics and iterate the .nft rules (e.g. Docker's compose bridges, or a NAT/masquerade gap) — this is exactly what the harness is for. Keep iterating Step 2 until verify passes.

  • Step 5: Idempotence + lint + Molecule. make lint; make test ROLE=docker_host (add a Molecule assertion that the drop-in file renders if the role has a scenario).

  • Step 6: Commit. git commit -am "fix(docker_host): container-forward nftables drop-in survives reboot (FRICTION #1, ADR-025)"


Phase G — le-staging cert tier

Task 17: validate --certs le-staging

Files: none new (exercises Task 11/12); may tweak overrides/askari.yml if DNS-01 names need adjusting.

  • Step 1: Pre-flightrbw unlocked (the run needs vault.gandi.pat for DNS-01). The VM needs outbound egress (the boma-it NAT net provides it).

  • Step 2: Run with the staging cert tier

Run: make test-integration HOST=askari CERTS=le-staging Expected: same apply, but Caddy now uses DNS-01 against LE staging (untrusted root) for the profile's route hostnames (under wingu.me, whose DNS lives at Gandi). Verify still passes (the :443 probe uses validate_certs: false).

  • Step 3: Confirm a real staging cert issuedmake test-integration HOST=askari CERTS=le-staging KEEP=1, then:
NAME=$(.venv/bin/python -c "print(open('tests/integration/.run/current').read().split()[0])")
IP=$(sed -n 2p tests/integration/.run/current)
ssh ansible@$IP "sudo docker exec caddy ls /data/caddy/certificates"   # adjust to the caddy data path

Expected: a cert dir under an acme-staging-v02... issuer path (proves the DNS-01 staging path works end to end). Then make test-integration-clean.

  • Step 4: Commit (only if overrides/certs needed tweaks): git commit -am "test(integration): validate le-staging DNS-01 cert path"

Phase H — Governance & docs

Task 18: ADR-025

Files: Create docs/decisions/025-local-vm-integration-testing.md

  • Step 1: Write the ADR — use docs/decisions/adr-template.md. Content (no placeholders — write these in full):

    • Status: Accepted (2026-06-18).
    • Context: Molecule (Level 1) can't catch reboot/firewall/Docker/boot-order bugs; the 2026-06-17 incident; ADR-008 Level 2/3 was deferred for lack of hosts but ubongo can host local KVM (verified /dev/kvm + VT-x).
    • Decision: libvirt/KVM (Approach A), one throwaway VM at a time from real inventory ("be askari"), stdlib driver over virsh, tiered certs (internal default, le-staging built, le-prod-wildcard on-demand), Ansible-managed substrate role, stubs via -e @ overlays.
    • Alternatives rejected: Proxmox-nested (heavy, ADR-015 tension, bugs aren't in provisioning); Vagrant (Ruby/plugin footprint, box drift); terraform-provider-libvirt (poor at imperative reboot loop, blurs ADR-006).
    • Consequences: new RAM load on ubongo (resource guard + one-at-a-time); reconciles ADR-015; accepted risk for le-prod-wildcard. Cross-reference ADR-008/015/006/024/016/020.
  • Step 2: Commit. git commit -am "docs(adr): ADR-025 local VM integration testing"

Task 19: pointers + entries

Files: Modify docs/decisions/008-testing.md, docs/decisions/015-control-host.md, docs/security/accepted-risks.md, CLAUDE.md, STATUS.md, docs/TODO.md, docs/hardware/reference.md

  • Step 1: ADR-008 — in the "what Molecule does NOT test" section, add a line: reboot-survivability / host-firewall×Docker / boot-order are now covered by local VM integration testing (ADR-025); add ADR-025 to the Level 2/3 description as its concrete build.

  • Step 2: ADR-015 — one line: ubongo runs ephemeral KVM test VMs as part of its local-test-runner role (ADR-025) — still not a production hypervisor; note the test-VM RAM load against the 16 GiB sizing.

  • Step 3: accepted-risks.md — add an entry: le-prod-wildcard integration runs — the production Gandi PAT (vault.gandi.pat) reaches an ephemeral local VM and transient _acme-challenge TXT records are written into the real wingu.me zone. Scope: on-demand only; staging is the default. Compensating: ephemeral VM, NAT-isolated, TXT auto-removed by Caddy. Owner/date.

  • Step 4: CLAUDE.md — add to the key-commands table:

| Integration-test a host on a local VM | `make test-integration HOST=<name> [CERTS=…]` |
| Clean up integration test VMs         | `make test-integration-clean`                |
  • Step 5: STATUS.md — add roles/integration_test/ + scripts/integration-vm.py to "Built + working"; note the RED→GREEN acceptance passed.

  • Step 6: TODO.md — collapse item 2.4 to a one-line pointer: "→ ADR-025 / make test-integration (built 2026-06-18)." (Do NOT renumber other items.)

  • Step 7: hardware/reference.md — add a note to ubongo's row/workloads: one integration VM (~3 GiB) at a time; don't run alongside a heavy Level-4 browser session.

  • Step 8: Commit. git commit -am "docs: wire ADR-025 into testing/control-host/risks/status/todo/capacity"

Task 20: runbook

Files: Create docs/runbooks/integration-testing.md

  • Step 1: Write it — sections: when to use it (firewall/sshd/boot/Docker changes, operationalises the standing "test risky infra before live deploy" rule + FRICTION #6 "validate reboot-recovery before retiring break-glass"); commands (cycle/up/apply/reboot/assert/down/prune/console, --certs, --keep); where diagnostics land (~/integration-runs/); how to inspect a kept failed VM (virsh console, ssh); the safety invariants; adding a new profile (a profiles/<host>.json + overrides/<host>.yml); the cert tiers and when to use each.

  • Step 2: Add a pre-flight line to docs/runbooks/new-host.md and the hardening runbook: before a lockout-risky change, make test-integration HOST=<name> and confirm reboot-recovery while the break-glass is still open.

  • Step 3: Commit. git commit -am "docs(runbook): integration-testing runbook + pre-flight cross-links"


Deferred (out of v1 scope — track in TODO/FRICTION, not this plan)

  • Task 21 (follow-on): coordinator fidelity — add netbird_coordinator to the askari profile's applies + the geo-DB stub var (needs reading roles/netbird_coordinator/), so signals #3 (mesh-bootstrap circularity) and #4 (egress FATAL-loop) reproduce. v1 gate is #1 only.
  • le-prod-wildcard issuance/persistence — issue *.test.wingu.me once, persist on ubongo, mount into the VM. Wired (cert file exists) but unused until needed.
  • Multi-VM mini-staging — inter-host mesh/dataplane.
  • Snapshot/reset — post-apply libvirt snapshot for fast re-runs without re-applying base roles.

Self-Review

Spec coverage: Approach A → Tasks 6-10. Substrate role → Task 1. Single-VM "be askari" → Tasks 12/15. Acceptance red→green → Tasks 15/16. Tiered certs (internal+le-staging built, le-prod-wildcard wired) → Tasks 11/12/17. Ansible-managed substrate → Task 1. Stubs in overlay (not inventory) → Task 12 (-e @). Safety invariants → Task 5 (single-host inv) + Task 12 (mesh_enabled: false) + Task 7 (isolated NAT). Resource guard / one-at-a-time → Task 7. Diagnostics → Task 9. Governance (ADR-025, ADR-008/015 pointers, accepted-risks, CLAUDE.md, runbook, STATUS, TODO, capacity) → Tasks 18-20. Gap closed: coordinator (#3/#4) explicitly deferred to Task 21 with the v1 gate stated as #1 — matches the spec's "minimum credible v1 is the red half" scoping.

Placeholder scan: none — _destroy's --nvram and the caddy data path in Task 17 Step 3 carry "adjust to actual" notes (verification actions, not placeholders). The base nftables handler name is a confirm-then-use step (Task 16 Step 3), not a guess.

Type/name consistency: vm_name/free_mib/parse_lease_ip/render_meta_data/render_user_data/cert_file/profile_path/render_run_hosts (pure, Tasks 3-5) ↔ used by up/apply/run_assert (Tasks 7-9). RUN_DIR/current written by up (Task 7), read by _read_current (Task 8). DISPATCH keys ↔ argparse subcommands (Task 2/10). Profile JSON keys (groups/applies/extra_vars_files/mem_mib/vcpus) ↔ apply (Task 8) + askari.json (Task 12). Cert files ↔ cert_file (Task 5) + Task 12. base__firewall_dropin_dir ↔ Task 16 template dest.