boma/docs/superpowers/plans/2026-06-18-local-vm-integration-testing.md
sjat 65533be4d9 docs(plan): implementation plan for local VM integration testing (2.4)
20-task TDD plan: integration_test substrate role, stdlib virsh driver, askari profile, tiered certs, RED->GREEN acceptance, docker_host container-forward fix, ADR-025 + docs. Follows the 2026-06-18 design spec.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 11:56:04 +02:00

1179 lines
49 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Local VM Integration Testing Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Give the agent a `make test-integration HOST=<name>` loop that boots a throwaway KVM VM on ubongo mirroring a real host, applies the real playbooks, performs a **real reboot**, and asserts outcomes — catching the reboot/firewall/Docker class Molecule cannot (the 2026-06-17 incident).
**Architecture:** A non-service `integration_test` role installs the libvirt/QEMU substrate on ubongo. A stdlib-only driver `scripts/integration-vm.py` orchestrates the lifecycle over `virsh`/`virt-install`/`cloud-localds` (golden Debian-13 image → ephemeral qcow2 overlay → cloud-init seed → boot → apply real playbooks via a single-host transient inventory → reboot → verify playbook → teardown). Stubs and cert-tiers are passed as Ansible `-e @file` extra-vars so the real inventory is never edited and the driver never parses YAML.
**Tech Stack:** Debian 13 (trixie), libvirt 11.3 / `virt-install` 5.0.0 / QEMU-KVM, cloud-init NoCloud (`cloud-image-utils` 0.33), Ansible, Caddy v2 (DNS-01 via the existing `caddy-gandi` image), Python 3 stdlib, pytest, Molecule (Docker).
**Verified facts (ADR-014, 2026-06-18):**
- Image: `https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2` + `SHA512SUMS` alongside. Ships cloud-init; **no qemu-guest-agent** → get IP via `virsh domifaddr <dom> --source lease`.
- Seed: `cloud-localds seed.img user-data [meta-data]` (`cloud-image-utils`). Label `cidata`.
- `virt-install --import --disk path=...,format=qcow2 --disk path=seed.img,device=cdrom --network network=<net> --osinfo debian13 --graphics none --serial file,path=<log> --noautoconsole` (package `virt-install`; `virtinst` is a transitional shim).
- Isolated NAT net via `virsh net-define/net-start/net-autostart` (own bridge+subnet, `<forward mode='nat'/>`).
- Caddy: `acme_ca https://acme-staging-v02.api.letsencrypt.org/directory` (global), `tls internal` (self-signed), `tls { dns gandi {env.GANDI_BEARER_TOKEN} }` (DNS-01; module already compiled into the boma `caddy-gandi` image). LE staging limits are effectively unlimited; use staging for routine cert tests.
**Repo facts this plan extends:**
- `roles/base/templates/nftables.conf.j2:21``chain forward { ... policy drop; }`; line 26 `include "{{ base__firewall_dropin_dir }}/*.nft"`; `base__firewall_dropin_dir: /etc/nftables.d`. **The drop-in include already exists**`docker_host` just needs to ship a `.nft` file.
- `base__firewall_apply` gates application (`roles/base/tasks/firewall.yml:32-35`).
- `roles/docker_host/` installs Docker only; **no container-forward rules** (the green-half fix).
- `roles/reverse_proxy/templates/Caddyfile.j2` — global `acme_dns gandi {env.GANDI_BEARER_TOKEN}` when `reverse_proxy__acme_dns_provider == 'gandi'`; per-site blocks; Gandi PAT via `vault.gandi.pat``env.j2` `GANDI_BEARER_TOKEN`. **No `acme_ca` or `tls internal` knob yet** (this plan adds them).
- askari: `inventories/production/offsite.yml` (`ansible_host: 77.42.120.136`, group `offsite_hosts`); `group_vars/offsite_hosts/vars.yml` (`base__firewall_apply: false`, `base__ssh_listen_mesh_only: false`); routes in `group_vars/all/reverse_proxy.yml`.
- `playbooks/site.yml` (base→all, docker_host→docker_hosts) + `playbooks/offsite.yml` (docker_host→reverse_proxy→netbird_coordinator on offsite_hosts).
- Makefile vars: `VENV PLAYBOOK_BIN INVENTORY VAULT_ARGS ROLE PLAYBOOK LIMIT TAGS`. pytest in `tests/test_*.py` (no conftest/pytest.ini; importlib-load of hyphenated scripts, see `tests/test_firewall_rules.py:1-13`). Tag vocabulary `tests/tags.yml`; `scripts/check-tags.py` run by `make lint`.
- None of `roles/integration_test/`, `scripts/integration-vm.py`, `tests/integration/` exist.
---
## File Structure
**Create:**
- `roles/integration_test/` — substrate role (defaults, tasks, handlers, meta, README, molecule/default/{molecule,converge,verify}.yml). Installs libvirt/QEMU/virt-install/cloud-image-utils; enables `libvirtd`; adds `sjat`/`claude` to `libvirt`+`kvm` groups; creates the image cache dir.
- `scripts/integration-vm.py` — stdlib-only driver. Pure helpers + impure orchestration + argparse CLI.
- `tests/test_integration_vm.py` — pytest for the driver's pure helpers.
- `tests/integration/profiles/askari.json` — driver-side profile metadata (groups, playbook+tags list, extra-vars files, mem/vcpu).
- `tests/integration/overrides/askari.yml` — Ansible stub extra-vars (firewall on, ssh break-glass).
- `tests/integration/certs/{internal,le-staging,le-prod-wildcard}.yml` — cert-tier extra-vars.
- `tests/integration/verify.yml` — outcome-based verify playbook.
- `tests/integration/README.md` — how the harness works.
- `docs/decisions/025-local-vm-integration-testing.md` — ADR.
- `docs/runbooks/integration-testing.md` — operator/agent runbook.
**Modify:**
- `roles/reverse_proxy/defaults/main.yml` + `templates/Caddyfile.j2` — add `reverse_proxy__tls_internal` + `reverse_proxy__acme_ca` knobs.
- `roles/docker_host/defaults/main.yml` + `tasks/main.yml` + new `templates/10-docker-forward.nft.j2` — the container-forward drop-in (green-half).
- `Makefile``test-integration`, `test-integration-clean` targets.
- `.gitignore``tests/integration/.run/`, `/integration-runs/` is under $HOME (already outside repo).
- `docs/decisions/008-testing.md`, `015-control-host.md`; `docs/security/accepted-risks.md`; `CLAUDE.md`; `STATUS.md`; `docs/TODO.md`; `docs/hardware/reference.md` — pointers/entries.
**Milestones:** RED (Task 15: harness reproduces the incident) → GREEN (Task 16: docker_host fix survives reboot) → le-staging cert tier (Task 17) → governance/docs (Tasks 18-20).
---
## Phase A — Substrate role
### Task 1: `integration_test` role (libvirt/QEMU substrate)
**Files:**
- Create: `roles/integration_test/{defaults,tasks,handlers,meta}/main.yml`, `roles/integration_test/README.md`, `roles/integration_test/molecule/default/{molecule,converge,verify}.yml`
- [ ] **Step 1: Scaffold**
Run: `make new-role NAME=integration_test`
Expected: `Role integration_test scaffolded at roles/integration_test/`
- [ ] **Step 2: defaults/main.yml**
```yaml
---
# integration_test — installs the local KVM/libvirt substrate on the control node
# (ubongo) so the agent can run throwaway VM integration tests (ADR-025). Non-service
# role; applied to the `control` group. Not a production hypervisor (ADR-015).
integration_test__packages:
- qemu-system-x86 # KVM
- qemu-utils # qemu-img (overlays)
- libvirt-daemon-system
- libvirt-clients # virsh
- virt-install # virt-install (trixie: the real pkg; `virtinst` is transitional)
- cloud-image-utils # cloud-localds (NoCloud seed)
- genisoimage # cloud-localds fallback
# Users granted libvirt/kvm access (run VMs without sudo).
integration_test__users:
- sjat
- claude
# Where the golden image + overlays live (outside the repo).
integration_test__cache_dir: "/var/lib/boma-integration"
```
- [ ] **Step 3: tasks/main.yml**
```yaml
---
- name: Install the KVM/libvirt substrate
ansible.builtin.apt:
name: "{{ integration_test__packages }}"
state: present
update_cache: true
tags: [packages]
- name: Enable and start libvirtd
ansible.builtin.systemd:
name: libvirtd
enabled: true
state: started
tags: [config]
- name: Grant users libvirt + kvm access
ansible.builtin.user:
name: "{{ item }}"
groups: [libvirt, kvm]
append: true
loop: "{{ integration_test__users }}"
tags: [users]
- name: Create the integration cache dir
ansible.builtin.file:
path: "{{ integration_test__cache_dir }}"
state: directory
owner: root
group: libvirt
mode: "2775"
tags: [config]
```
- [ ] **Step 4: meta/main.yml** (mirror `roles/dev_env/meta/main.yml`: author `sjat`, Debian/trixie, `min_ansible_version: "2.17"`, `dependencies: []`, description naming ADR-025). **handlers/main.yml** stays `---` (no handlers). **README.md**: purpose, that it targets the `control` group, links ADR-025/ADR-015.
- [ ] **Step 5: molecule/default/molecule.yml** — copy `roles/dev_env/molecule/default/molecule.yml` verbatim (same Debian-13 systemd image).
- [ ] **Step 6: molecule/default/converge.yml**
```yaml
---
- name: Converge
hosts: all
become: true
gather_facts: true
roles:
- role: integration_test
```
- [ ] **Step 7: molecule/default/verify.yml** (assert install tasks — NOT libvirtd active, which cannot run KVM-in-Docker)
```yaml
---
- name: Verify
hosts: all
become: true
gather_facts: false
tasks:
- name: Gather package facts
ansible.builtin.package_facts:
- name: Assert the substrate packages are installed
ansible.builtin.assert:
that:
- "'libvirt-clients' in ansible_facts.packages"
- "'virt-install' in ansible_facts.packages"
- "'cloud-image-utils' in ansible_facts.packages"
- "'qemu-system-x86' in ansible_facts.packages"
- name: Cache dir exists
ansible.builtin.stat:
path: /var/lib/boma-integration
register: _cache
- name: Assert cache dir
ansible.builtin.assert:
that: [_cache.stat.isdir]
```
- [ ] **Step 8: Add the role to the control-node play.** Edit `playbooks/workstation.yml` (the control-node playbook that applies `dev_env`) to also import `integration_test` for `control`. Confirm the exact play first:
Run: `grep -n "dev_env\|hosts:\|control" playbooks/workstation.yml`
Then add under the same `control` play's roles:
```yaml
- role: integration_test
tags: [integration_test]
```
- [ ] **Step 9: Lint + Molecule**
Run: `make lint`
Expected: clean (new role-name tag `integration_test` auto-accepted by check-tags; concern tags `packages`/`config`/`users` are in `tests/tags.yml`).
Run: `make test ROLE=integration_test`
Expected: converge + idempotence + verify PASS.
- [ ] **Step 10: Commit**
```bash
git add roles/integration_test playbooks/workstation.yml
git commit -m "feat(integration_test): KVM/libvirt substrate role on the control node"
```
---
## Phase B — Driver: pure helpers (TDD)
### Task 2: Driver skeleton + constants + CLI dispatch
**Files:**
- Create: `scripts/integration-vm.py`
- Test: `tests/test_integration_vm.py`
- [ ] **Step 1: Write the failing test** (`tests/test_integration_vm.py`)
```python
import importlib.util
import pathlib
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "integration-vm.py"
_spec = importlib.util.spec_from_file_location("integration_vm", _PATH)
ivm = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(ivm)
def test_valid_tiers():
assert ivm.VALID_TIERS == ("internal", "le-staging", "le-prod-wildcard")
```
- [ ] **Step 2: Run it — fails (file missing)**
Run: `.venv/bin/pytest tests/test_integration_vm.py -q`
Expected: FAIL (cannot load `scripts/integration-vm.py`).
- [ ] **Step 3: Create the skeleton** (`scripts/integration-vm.py`)
```python
#!/usr/bin/env python3
"""boma local-VM integration test harness driver (ADR-025).
Stdlib-only by convention (TODO-14): never imports a YAML library. The transient
inventory is emitted via string templates; stubs/cert-tiers reach Ansible as
`-e @<file>` extra-vars; profile metadata is JSON. Talks to libvirt via `virsh`.
"""
import argparse
import hashlib
import json
import os
import pathlib
import re
import shutil
import subprocess
import sys
import time
import urllib.request
import uuid
REPO_ROOT = pathlib.Path(__file__).resolve().parent.parent
CACHE_DIR = pathlib.Path(os.environ.get("BOMA_IT_CACHE", "/var/lib/boma-integration"))
IMAGE_URL = "https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2"
SHA_URL = "https://cloud.debian.org/images/cloud/trixie/latest/SHA512SUMS"
IMAGE_NAME = "debian-13-genericcloud-amd64.qcow2"
NET_NAME = "boma-it"
NET_XML = """<network>
<name>boma-it</name>
<forward mode='nat'/>
<bridge name='virbr-boma' stp='on' delay='0'/>
<ip address='192.168.150.1' netmask='255.255.255.0'>
<dhcp><range start='192.168.150.10' end='192.168.150.254'/></dhcp>
</ip>
</network>
"""
NAME_PREFIX = "boma-it-"
RUN_DIR = REPO_ROOT / "tests" / "integration" / ".run"
DIAG_ROOT = pathlib.Path.home() / "integration-runs"
PROFILE_DIR = REPO_ROOT / "tests" / "integration" / "profiles"
INTEG_DIR = REPO_ROOT / "tests" / "integration"
CERT_DIR = REPO_ROOT / "tests" / "integration" / "certs"
DEFAULT_MEM_MIB = 3072
DEFAULT_VCPUS = 2
MIN_FREE_MIB = 4096
VALID_TIERS = ("internal", "le-staging", "le-prod-wildcard")
def main(argv=None):
p = argparse.ArgumentParser(prog="integration-vm", description=__doc__)
sub = p.add_subparsers(dest="cmd", required=True)
for c in ("up", "apply", "reboot", "assert", "cycle", "down", "console"):
sp = sub.add_parser(c)
sp.add_argument("--host", required=True)
sp.add_argument("--certs", choices=VALID_TIERS, default="internal")
sp.add_argument("--keep", action="store_true")
sp.add_argument("--no-reboot", action="store_true")
sub.add_parser("prune")
args = p.parse_args(argv)
return DISPATCH[args.cmd](args)
if __name__ == "__main__": # pragma: no cover
sys.exit(main())
```
(Define `DISPATCH = {...}` after the command functions in later tasks; for now add a temporary `DISPATCH = {}` above `main` so import succeeds.)
- [ ] **Step 4: Run — passes**
Run: `.venv/bin/pytest tests/test_integration_vm.py -q`
Expected: PASS.
- [ ] **Step 5: Commit**
```bash
git add scripts/integration-vm.py tests/test_integration_vm.py
git commit -m "feat(integration-vm): driver skeleton + CLI dispatch"
```
### Task 3: `vm_name`, `free_mib`, `parse_lease_ip` (TDD)
**Files:** Modify `scripts/integration-vm.py`, `tests/test_integration_vm.py`
- [ ] **Step 1: Write failing tests**
```python
def test_vm_name_prefix_and_suffix():
assert ivm.vm_name("askari", "ab12cd34") == "boma-it-askari-ab12cd34"
def test_vm_name_generates_suffix():
n = ivm.vm_name("askari")
assert n.startswith("boma-it-askari-") and len(n.split("-")[-1]) == 8
def test_free_mib_parses_memavailable():
sample = "MemTotal: 16331156 kB\nMemAvailable: 8388608 kB\n"
assert ivm.free_mib(sample) == 8192
def test_parse_lease_ip_extracts_ipv4():
out = (" Name MAC address Protocol Address\n"
"-------------------------------------------------------------------\n"
" vnet0 52:54:00:aa:bb:cc ipv4 192.168.150.42/24\n")
assert ivm.parse_lease_ip(out) == "192.168.150.42"
def test_parse_lease_ip_none_when_absent():
assert ivm.parse_lease_ip("no leases\n") is None
```
- [ ] **Step 2: Run — fail.** `.venv/bin/pytest tests/test_integration_vm.py -q` → FAIL (no attrs).
- [ ] **Step 3: Implement** (add to `scripts/integration-vm.py`)
```python
def vm_name(host, suffix=None):
suffix = suffix or uuid.uuid4().hex[:8]
return f"{NAME_PREFIX}{host}-{suffix}"
def free_mib(meminfo_text):
m = re.search(r"^MemAvailable:\s+(\d+)\s+kB", meminfo_text, re.MULTILINE)
return int(m.group(1)) // 1024 if m else 0
def parse_lease_ip(domifaddr_output):
m = re.search(r"ipv4\s+(\d+\.\d+\.\d+\.\d+)", domifaddr_output)
return m.group(1) if m else None
```
- [ ] **Step 4: Run — pass.** `.venv/bin/pytest tests/test_integration_vm.py -q` → PASS.
- [ ] **Step 5: Commit.** `git commit -am "feat(integration-vm): vm naming, RAM guard, lease IP parsing"`
### Task 4: cloud-init `render_meta_data` / `render_user_data` (TDD)
**Files:** Modify driver + tests
- [ ] **Step 1: Write failing tests**
```python
def test_meta_data_has_instance_and_hostname():
md = ivm.render_meta_data("iid-askari-x", "boma-it-askari-x")
assert "instance-id: iid-askari-x" in md
assert "local-hostname: boma-it-askari-x" in md
def test_user_data_injects_key_and_ansible_user():
ud = ivm.render_user_data("ssh-ed25519 AAAA... claude@ubongo", "ansible")
assert ud.startswith("#cloud-config")
assert "name: ansible" in ud
assert "ssh-ed25519 AAAA... claude@ubongo" in ud
assert "NOPASSWD:ALL" in ud
```
- [ ] **Step 2: Run — fail.**
- [ ] **Step 3: Implement**
```python
def render_meta_data(instance_id, hostname):
return f"instance-id: {instance_id}\nlocal-hostname: {hostname}\n"
def render_user_data(ssh_pubkey, ansible_user):
return (
"#cloud-config\n"
"users:\n"
f" - name: {ansible_user}\n"
" sudo: 'ALL=(ALL) NOPASSWD:ALL'\n"
" shell: /bin/bash\n"
" ssh_authorized_keys:\n"
f" - {ssh_pubkey}\n"
"ssh_pwauth: false\n"
"package_update: false\n"
)
```
- [ ] **Step 4: Run — pass.**
- [ ] **Step 5: Commit.** `git commit -am "feat(integration-vm): cloud-init user-data/meta-data rendering"`
### Task 5: `cert_file`, `profile_path`, `render_run_hosts` (TDD)
**Files:** Modify driver + tests
- [ ] **Step 1: Write failing tests**
```python
def test_cert_file_valid_tier():
p = ivm.cert_file("le-staging")
assert p.name == "le-staging.yml" and p.parent.name == "certs"
def test_cert_file_rejects_bad_tier():
import pytest
with pytest.raises(ValueError):
ivm.cert_file("bogus")
def test_render_run_hosts_single_host_in_groups():
out = ivm.render_run_hosts("boma-it-askari-x", "192.168.150.42",
"ansible", ["offsite_hosts"])
assert "offsite_hosts:" in out
assert "boma-it-askari-x:" in out
assert "ansible_host: 192.168.150.42" in out
assert "ansible_user: ansible" in out
# invariant: the real askari host must NOT appear
assert "askari:" not in out.replace("boma-it-askari-x:", "")
```
- [ ] **Step 2: Run — fail.**
- [ ] **Step 3: Implement**
```python
def cert_file(tier):
if tier not in VALID_TIERS:
raise ValueError(f"unknown cert tier: {tier}")
return CERT_DIR / f"{tier}.yml"
def profile_path(host):
return PROFILE_DIR / f"{host}.json"
def render_run_hosts(name, ip, ansible_user, groups):
lines = [
"# Generated by scripts/integration-vm.py — transient, gitignored. Do not edit.",
"# Single test host ONLY (safety invariant: no real host is ever in scope).",
"all:",
" children:",
]
for g in groups:
lines += [
f" {g}:",
" hosts:",
f" {name}:",
f" ansible_host: {ip}",
f" ansible_user: {ansible_user}",
]
return "\n".join(lines) + "\n"
```
- [ ] **Step 4: Run — pass.**
- [ ] **Step 5: Commit.** `git commit -am "feat(integration-vm): cert-tier + profile + transient inventory rendering"`
---
## Phase C — Driver: orchestration (impure)
### Task 6: `sh` helper + `ensure_image`
**Files:** Modify driver
- [ ] **Step 1: Implement the subprocess helper + image fetch**
```python
def sh(cmd, check=True, capture=False, **kw):
"""Run a command (list form). Logs the command to stderr."""
print("+ " + " ".join(str(c) for c in cmd), file=sys.stderr)
return subprocess.run(cmd, check=check,
capture_output=capture, text=True, **kw)
def _expected_sha(sha_text, filename):
for line in sha_text.splitlines():
parts = line.split()
if len(parts) == 2 and parts[1].lstrip("*") == filename:
return parts[0]
return None
def ensure_image():
CACHE_DIR.mkdir(parents=True, exist_ok=True)
img = CACHE_DIR / IMAGE_NAME
if img.exists():
return img
print(f"Downloading {IMAGE_URL} ...", file=sys.stderr)
tmp = img.with_suffix(".part")
urllib.request.urlretrieve(IMAGE_URL, tmp)
sha_text = urllib.request.urlopen(SHA_URL).read().decode()
want = _expected_sha(sha_text, IMAGE_NAME)
if not want:
tmp.unlink(missing_ok=True)
raise SystemExit(f"checksum for {IMAGE_NAME} not found at {SHA_URL}")
h = hashlib.sha512()
with open(tmp, "rb") as fh:
for chunk in iter(lambda: fh.read(1 << 20), b""):
h.update(chunk)
if h.hexdigest() != want:
tmp.unlink(missing_ok=True)
raise SystemExit("golden image SHA512 mismatch — refusing to use it")
tmp.rename(img)
return img
```
- [ ] **Step 2: Manual verification**
Run: `.venv/bin/python scripts/integration-vm.py prune` (after Task 10 adds `prune`; for now) — or test `ensure_image` directly:
```bash
.venv/bin/python -c "import importlib.util,pathlib; \
s=importlib.util.spec_from_file_location('ivm','scripts/integration-vm.py'); \
m=importlib.util.module_from_spec(s); s.loader.exec_module(m); print(m.ensure_image())"
```
Expected: downloads to `/var/lib/boma-integration/debian-13-genericcloud-amd64.qcow2`, SHA512 verified, prints the path. (Requires Task 1's role applied so the cache dir is group-writable, or run with sudo once.)
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): golden image fetch + SHA512 verification"`
### Task 7: `net_ensure`, `up` (boot a VM)
**Files:** Modify driver
- [ ] **Step 1: Implement**
```python
def net_ensure():
r = sh(["virsh", "net-info", NET_NAME], check=False, capture=True)
if r.returncode != 0:
xml = RUN_DIR / "net.xml"
RUN_DIR.mkdir(parents=True, exist_ok=True)
xml.write_text(NET_XML)
sh(["virsh", "net-define", str(xml)])
sh(["virsh", "net-autostart", NET_NAME])
active = sh(["virsh", "net-info", NET_NAME], capture=True).stdout
if "Active: yes" not in active:
sh(["virsh", "net-start", NET_NAME])
def _ssh_pubkey():
for cand in ("id_ed25519.pub", "id_rsa.pub"):
p = pathlib.Path.home() / ".ssh" / cand
if p.exists():
return p.read_text().strip()
raise SystemExit("no SSH public key found in ~/.ssh")
def up(host, name=None, mem_mib=DEFAULT_MEM_MIB, vcpus=DEFAULT_VCPUS):
free = free_mib(pathlib.Path("/proc/meminfo").read_text())
if free < MIN_FREE_MIB:
raise SystemExit(f"refusing to start: only {free} MiB free (< {MIN_FREE_MIB})")
running = sh(["virsh", "list", "--name"], capture=True).stdout.split()
if any(n.startswith(NAME_PREFIX) for n in running):
raise SystemExit("an integration VM is already running (one at a time); "
"run `integration-vm prune` first")
name = name or vm_name(host)
img = ensure_image()
net_ensure()
RUN_DIR.mkdir(parents=True, exist_ok=True)
overlay = RUN_DIR / f"{name}.qcow2"
sh(["qemu-img", "create", "-f", "qcow2", "-F", "qcow2", "-b", str(img), str(overlay)])
(RUN_DIR / "user-data").write_text(render_user_data(_ssh_pubkey(), "ansible"))
(RUN_DIR / "meta-data").write_text(render_meta_data(f"iid-{name}", name))
seed = RUN_DIR / f"{name}-seed.img"
sh(["cloud-localds", str(seed), str(RUN_DIR / "user-data"), str(RUN_DIR / "meta-data")])
DIAG_ROOT.mkdir(parents=True, exist_ok=True)
console = DIAG_ROOT / f"{name}-console.log"
sh(["virt-install", "--name", name, "--memory", str(mem_mib), "--vcpus", str(vcpus),
"--import",
"--disk", f"path={overlay},format=qcow2",
"--disk", f"path={seed},device=cdrom",
"--network", f"network={NET_NAME}",
"--osinfo", "debian13",
"--graphics", "none",
"--serial", f"file,path={console}",
"--noautoconsole"])
ip = wait_for_ip(name)
wait_for_ssh(ip, "ansible")
(RUN_DIR / "current").write_text(f"{name}\n{ip}\n{host}\n")
print(f"VM {name} up at {ip}")
return name, ip
def wait_for_ip(name, timeout=120):
end = time.time() + timeout
while time.time() < end:
out = sh(["virsh", "domifaddr", name, "--source", "lease"],
check=False, capture=True).stdout
ip = parse_lease_ip(out)
if ip:
return ip
time.sleep(4)
raise SystemExit(f"timed out waiting for {name} to get a DHCP lease")
def wait_for_ssh(ip, user, timeout=180):
end = time.time() + timeout
while time.time() < end:
r = sh(["ssh", "-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null", "-o", "ConnectTimeout=5",
f"{user}@{ip}", "true"], check=False, capture=True)
if r.returncode == 0:
return
time.sleep(5)
raise SystemExit(f"timed out waiting for SSH to {ip}")
```
- [ ] **Step 2: Manual smoke (real KVM — requires Task 1 applied to ubongo)**
```bash
.venv/bin/python scripts/integration-vm.py up --host askari # via DISPATCH once Task 10 lands
```
Expected: golden image present, `boma-it` net active, overlay + seed created, VM boots, prints `VM boma-it-askari-<id> up at 192.168.150.x`. SSH in: `ssh ansible@<ip>` works.
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): network + VM boot (overlay, cloud-init seed, virt-install import)"`
### Task 8: `write_run_inventory`, `apply`
**Files:** Modify driver
- [ ] **Step 1: Implement**
```python
def _read_current():
txt = (RUN_DIR / "current").read_text().splitlines()
return txt[0], txt[1], txt[2] # name, ip, host
def write_run_inventory(name, ip, groups):
RUN_DIR.mkdir(parents=True, exist_ok=True)
(RUN_DIR / "hosts.yml").write_text(
render_run_hosts(name, ip, "ansible", groups))
link = RUN_DIR / "group_vars"
target = REPO_ROOT / "inventories" / "production" / "group_vars"
if link.is_symlink() or link.exists():
if link.is_symlink():
link.unlink()
if not link.exists():
link.symlink_to(target)
def apply(host, certs):
name, ip, _ = _read_current()
prof = json.loads(profile_path(host).read_text())
write_run_inventory(name, ip, prof["groups"])
extra = []
for f in prof.get("extra_vars_files", []):
extra += ["-e", f"@{INTEG_DIR / f}"]
extra += ["-e", f"@{cert_file(certs)}"]
for step in prof["applies"]:
cmd = [".venv/bin/ansible-playbook", "-i", str(RUN_DIR) + "/",
f"playbooks/{step['playbook']}", "--limit", name]
if step.get("tags"):
cmd += ["--tags", ",".join(step["tags"])]
cmd += extra
sh(cmd, cwd=str(REPO_ROOT))
print(f"applied {host} profile to {name}")
```
- [ ] **Step 2: Manual verification** — deferred to the Task 15 RED run (needs the profile/overlay/cert files from Phase D). Lint passes regardless.
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): transient inventory + real-playbook apply"`
### Task 9: `reboot_vm`, `run_assert`, `dump_diagnostics`
**Files:** Modify driver
- [ ] **Step 1: Implement**
```python
def reboot_vm():
name, ip, _ = _read_current()
sh(["virsh", "reboot", name])
time.sleep(5)
wait_for_ssh(ip, "ansible")
print(f"{name} rebooted, SSH back at {ip}")
def run_assert(host, certs):
name, ip, _ = _read_current()
prof = json.loads(profile_path(host).read_text())
write_run_inventory(name, ip, prof["groups"])
extra = []
for f in prof.get("extra_vars_files", []):
extra += ["-e", f"@{INTEG_DIR / f}"]
extra += ["-e", f"@{cert_file(certs)}"]
cmd = [".venv/bin/ansible-playbook", "-i", str(RUN_DIR) + "/",
"tests/integration/verify.yml", "--limit", name] + extra
r = sh(cmd, cwd=str(REPO_ROOT), check=False)
if r.returncode != 0:
dump_diagnostics(name, ip)
raise SystemExit(f"VERIFY FAILED for {name} — diagnostics in {DIAG_ROOT}")
print(f"VERIFY PASSED for {name}")
def dump_diagnostics(name, ip):
d = DIAG_ROOT / name
d.mkdir(parents=True, exist_ok=True)
for label, cmd in [
("nft", "nft list ruleset"),
("docker", "docker ps -a"),
("ss", "ss -tlnp"),
("journal", "journalctl -b --no-pager"),
("critical-chain", "systemd-analyze critical-chain"),
]:
r = sh(["ssh", "-o", "StrictHostKeyChecking=no",
"-o", "UserKnownHostsFile=/dev/null",
f"ansible@{ip}", "sudo " + cmd], check=False, capture=True)
(d / f"{label}.txt").write_text((r.stdout or "") + (r.stderr or ""))
console = DIAG_ROOT / f"{name}-console.log"
if console.exists():
shutil.copy(console, d / "console.log")
print(f"diagnostics written to {d}", file=sys.stderr)
```
- [ ] **Step 2: Commit.** `git commit -am "feat(integration-vm): reboot, verify run, failure diagnostics"`
### Task 10: `down`, `prune`, `console`, `cycle` + `DISPATCH`
**Files:** Modify driver
- [ ] **Step 1: Implement**
```python
def _destroy(name):
sh(["virsh", "destroy", name], check=False)
sh(["virsh", "undefine", name, "--nvram"], check=False)
for f in RUN_DIR.glob(f"{name}*"):
f.unlink(missing_ok=True)
def down(host=None, keep=False):
if keep:
print("--keep: leaving the VM running for inspection")
return
cur = RUN_DIR / "current"
if cur.exists():
name = cur.read_text().splitlines()[0]
_destroy(name)
cur.unlink(missing_ok=True)
print(f"destroyed {name}")
def prune():
running = sh(["virsh", "list", "--all", "--name"], capture=True).stdout.split()
for n in running:
if n.startswith(NAME_PREFIX):
_destroy(n)
print(f"pruned {n}")
(RUN_DIR / "current").unlink(missing_ok=True)
def console():
name = (RUN_DIR / "current").read_text().splitlines()[0]
log = DIAG_ROOT / f"{name}-console.log"
print(log.read_text() if log.exists() else f"no console log at {log}")
def cycle(host, certs, keep=False, no_reboot=False):
try:
up(host)
apply(host, certs)
if not no_reboot:
reboot_vm()
run_assert(host, certs)
finally:
# On success destroy; on failure (SystemExit) keep for inspection unless --keep flips it.
if not keep:
down(host)
```
Wire the dispatch (replace the temporary `DISPATCH = {}`):
```python
DISPATCH = {
"up": lambda a: (up(a.host), None)[1],
"apply": lambda a: apply(a.host, a.certs),
"reboot": lambda a: reboot_vm(),
"assert": lambda a: run_assert(a.host, a.certs),
"down": lambda a: down(a.host, a.keep),
"console": lambda a: console(),
"prune": lambda a: prune(),
"cycle": lambda a: cycle(a.host, a.certs, a.keep, a.no_reboot),
}
```
Fix `cycle`'s teardown semantics: on **failure** keep the VM (so it can be inspected); on **success** destroy. Implement by catching success explicitly:
```python
def cycle(host, certs, keep=False, no_reboot=False):
ok = False
try:
up(host); apply(host, certs)
if not no_reboot:
reboot_vm()
run_assert(host, certs)
ok = True
finally:
if ok and not keep:
down(host)
elif not ok:
print("FAILED — VM left up for inspection; `integration-vm prune` to clean.",
file=sys.stderr)
```
- [ ] **Step 2: Run unit tests + lint.** `.venv/bin/pytest tests/test_integration_vm.py -q` PASS; `make lint` clean.
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): teardown, prune, console, full cycle + dispatch"`
---
## Phase D — Profile, cert `internal` tier, verify playbook
### Task 11: reverse_proxy `tls internal` + `acme_ca` knobs
**Files:** Modify `roles/reverse_proxy/defaults/main.yml`, `roles/reverse_proxy/templates/Caddyfile.j2`
- [ ] **Step 1: defaults** — append:
```yaml
# Integration-test / staging cert knobs (ADR-025). Default off = production behaviour.
reverse_proxy__tls_internal: false # true => every site uses Caddy's self-signed CA
reverse_proxy__acme_ca: "" # set to the LE staging directory URL to use staging
```
- [ ] **Step 2: Caddyfile.j2** — in the global options block (after the `email` line), add:
```jinja
{% if reverse_proxy__acme_ca %}
acme_ca {{ reverse_proxy__acme_ca }}
{% endif %}
```
In each site block (inside `{{ r['host'] }} {`), add as the first directive:
```jinja
{% if reverse_proxy__tls_internal %}
tls internal
{% endif %}
```
- [ ] **Step 3: Molecule regression** — confirm `reverse_proxy` still renders. If the role has a Molecule scenario, run `make test ROLE=reverse_proxy`; else `make lint`.
Expected: clean; default-off means production output is byte-identical (the `{% if %}` blocks emit nothing).
- [ ] **Step 4: Commit.** `git commit -am "feat(reverse_proxy): tls-internal + acme_ca knobs for integration/staging (ADR-025)"`
### Task 12: askari profile + overlay + cert-tier files
**Files:** Create `tests/integration/profiles/askari.json`, `tests/integration/overrides/askari.yml`, `tests/integration/certs/{internal,le-staging,le-prod-wildcard}.yml`
- [ ] **Step 1: `profiles/askari.json`**
```json
{
"groups": ["offsite_hosts"],
"applies": [
{"playbook": "site.yml", "tags": ["base"]},
{"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
],
"extra_vars_files": ["overrides/askari.yml"],
"mem_mib": 3072,
"vcpus": 2
}
```
(`netbird_coordinator` is intentionally omitted from v1 `applies` — Caddy's published :443 gives the DNAT that reproduces FRICTION #1. Coordinator fidelity (#3/#4) is a follow-on, Task 21.)
- [ ] **Step 2: `overrides/askari.yml`** (Ansible extra-vars; highest precedence — never edits real inventory)
```yaml
---
# Integration-test overlay for the "askari" profile (ADR-025). Passed via `-e @`.
# Reproduces the 2026-06-17 incident: apply base's nftables default-deny to a Docker host.
base__firewall_apply: true
# Keep a break-glass: sshd stays on all interfaces (never wt0-only in a throwaway VM).
base__ssh_listen_mesh_only: false
# The VM is isolated; it must never touch the real mesh.
base__mesh_enabled: false
```
- [ ] **Step 3: cert-tier files**
`certs/internal.yml`:
```yaml
---
reverse_proxy__tls_internal: true
```
`certs/le-staging.yml`:
```yaml
---
reverse_proxy__tls_internal: false
reverse_proxy__acme_dns_provider: gandi
reverse_proxy__acme_ca: "https://acme-staging-v02.api.letsencrypt.org/directory"
```
`certs/le-prod-wildcard.yml`:
```yaml
---
# On-demand only. Records an accepted risk (ADR-025 / accepted-risks.md): the prod
# Gandi PAT reaches an ephemeral VM and transient TXT records land in the real wingu.me.
reverse_proxy__tls_internal: false
reverse_proxy__acme_dns_provider: gandi
reverse_proxy__acme_ca: ""
```
- [ ] **Step 4: Commit.** `git commit -am "feat(integration): askari profile, stub overlay, cert-tier files"`
### Task 13: verify playbook
**Files:** Create `tests/integration/verify.yml`
- [ ] **Step 1: Write it**
```yaml
---
# Integration verify (ADR-025). Outcome-based: proves Docker forwarding survives the
# reboot. The load-bearing check probes the VM's published :443 FROM the controller
# (ubongo) — if base's forward-drop killed DNAT, this times out (the FRICTION #1 bug).
- name: Verify the rebooted host
hosts: all
become: true
gather_facts: false
tasks:
- name: Docker daemon is active
ansible.builtin.command: systemctl is-active docker
changed_when: false
- name: Forward chain permits container traffic (drop-in loaded)
ansible.builtin.command: nft list chain inet filter forward
register: _fwd
changed_when: false
- name: Assert container forwarding is allowed (not pure drop)
ansible.builtin.assert:
that: "'accept' in _fwd.stdout"
fail_msg: >-
forward chain is pure drop — container forwarding will die on reboot
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
- name: Published HTTPS port answers from the controller (DNAT + forward alive)
delegate_to: localhost
become: false
ansible.builtin.uri:
url: "https://{{ ansible_host }}/"
validate_certs: false
status_code: [200, 308, 404, 502, 503]
timeout: 10
register: _probe
retries: 5
delay: 6
until: _probe is succeeded
```
- [ ] **Step 2: Lint.** `make lint` — clean (file is under `tests/`, not `playbooks/`, but keep tags valid; this play uses none, which is fine).
- [ ] **Step 3: Commit.** `git commit -am "feat(integration): outcome-based verify playbook (DNAT-survives-reboot)"`
---
## Phase E — Makefile + RED milestone
### Task 14: Makefile targets + .gitignore
**Files:** Modify `Makefile`, `.gitignore`
- [ ] **Step 1: Makefile** — add after the `test-all` target:
```makefile
test-integration:
ifndef HOST
$(error HOST is required: make test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1])
endif
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py cycle \
--host $(HOST) $(if $(CERTS),--certs $(CERTS)) $(if $(KEEP),--keep)
test-integration-clean:
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py prune
```
Add both to `.PHONY` and the `help` block (match the existing style).
- [ ] **Step 2: .gitignore** — add:
```
# Integration-test transient run dir (ADR-025); diagnostics live under ~/integration-runs
tests/integration/.run/
```
- [ ] **Step 3: Commit.** `git commit -am "feat(make): test-integration / test-integration-clean targets"`
### Task 15: RED milestone — reproduce the incident
**Files:** none (a validation run); record the outcome.
- [ ] **Step 1: Pre-flight** — confirm `rbw unlocked` (the apply decrypts `group_vars/all/vault.yml`); confirm Task 1's role is applied to ubongo (`virsh version` works, you're in the `libvirt` group — may need a re-login).
- [ ] **Step 2: Run the cycle on TODAY's base (no docker_host fix yet)**
Run: `make test-integration HOST=askari`
Expected: VM boots → base (firewall on) + docker_host + reverse_proxy apply → **reboot** → verify **FAILS** at "Assert container forwarding is allowed" and/or the :443 probe times out. Diagnostics appear under `~/integration-runs/boma-it-askari-<id>/` (nft shows `forward { policy drop }` with no accepts; the published port is dead).
- [ ] **Step 3: Confirm the failure is the RIGHT one** — read `~/integration-runs/<name>/nft.txt`: the `inet filter forward` chain is pure `policy drop`. This is the faithful reproduction of FRICTION #1. **If verify PASSES here, the harness is not faithful — stop and investigate** (e.g. Docker re-added its own accepts, or the firewall didn't apply).
- [ ] **Step 4: Clean up.** `make test-integration-clean`
- [ ] **Step 5: Record** — append a `[gotcha]`/milestone note to `docs/FRICTION.md` Open signals: "ADR-025 harness reproduced the 2026-06-17 firewall×Docker×reboot bug on a local VM (RED). Diagnostics: nft forward pure-drop, :443 DNAT dead post-reboot." Commit:
```bash
git commit -am "test(integration): RED — harness reproduces the 2026-06-17 incident"
```
---
## Phase F — GREEN milestone (docker_host fix)
### Task 16: docker_host container-forward drop-in
**Files:** Modify `roles/docker_host/defaults/main.yml`, `roles/docker_host/tasks/main.yml`; Create `roles/docker_host/templates/10-docker-forward.nft.j2`
- [ ] **Step 1: defaults** — append:
```yaml
# Container-forward nftables drop-in (FRICTION 2026-06-17 #1 / ADR-025). base's inet
# filter forward chain is `policy drop`; a drop verdict there is final, so Docker's own
# ip-filter accepts can't save forwarded container traffic. We append accepts to base's
# forward chain via base's /etc/nftables.d/*.nft include. Only meaningful on hosts where
# base__firewall_apply is true.
docker_host__forward_dropin: true
```
- [ ] **Step 2: template `templates/10-docker-forward.nft.j2`**
```jinja
# {{ ansible_managed }}
# Allow container forwarding through base's default-deny forward chain (ADR-025).
table inet filter {
chain forward {
ct state established,related accept
iifname "docker0" accept
oifname "docker0" accept
iifname "br-+" accept
oifname "br-+" accept
}
}
```
- [ ] **Step 3: tasks/main.yml** — append (after Docker install):
```yaml
- name: Install the container-forward nftables drop-in
ansible.builtin.template:
src: 10-docker-forward.nft.j2
dest: "{{ base__firewall_dropin_dir }}/10-docker-forward.nft"
mode: "0644"
when: docker_host__forward_dropin | bool
notify: reload nftables
tags: [firewall]
```
Confirm the handler name base exposes:
Run: `grep -rn "listen:\|reload nftables\|nftables" roles/base/handlers/main.yml`
Use base's actual handler `listen:` topic; if none fits, add a `docker_host` handler that runs `nft -f /etc/nftables.conf` (the same reload base uses). Show the handler you add in `roles/docker_host/handlers/main.yml`:
```yaml
---
- name: reload nftables
ansible.builtin.command: nft -f /etc/nftables.conf
listen: reload nftables
```
- [ ] **Step 4: GREEN run**
Run: `make test-integration HOST=askari`
Expected: apply (now includes the drop-in) → reboot → verify **PASSES** (forward chain has `accept` rules; :443 answers from ubongo). This is the red→green proof.
If it still fails, read diagnostics and iterate the `.nft` rules (e.g. Docker's compose bridges, or a NAT/masquerade gap) — **this is exactly what the harness is for**. Keep iterating Step 2 until verify passes.
- [ ] **Step 5: Idempotence + lint + Molecule.** `make lint`; `make test ROLE=docker_host` (add a Molecule assertion that the drop-in file renders if the role has a scenario).
- [ ] **Step 6: Commit.** `git commit -am "fix(docker_host): container-forward nftables drop-in survives reboot (FRICTION #1, ADR-025)"`
---
## Phase G — le-staging cert tier
### Task 17: validate `--certs le-staging`
**Files:** none new (exercises Task 11/12); may tweak `overrides/askari.yml` if DNS-01 names need adjusting.
- [ ] **Step 1: Pre-flight**`rbw unlocked` (the run needs `vault.gandi.pat` for DNS-01). The VM needs outbound egress (the `boma-it` NAT net provides it).
- [ ] **Step 2: Run with the staging cert tier**
Run: `make test-integration HOST=askari CERTS=le-staging`
Expected: same apply, but Caddy now uses DNS-01 against LE **staging** (untrusted root) for the profile's route hostnames (under `wingu.me`, whose DNS lives at Gandi). Verify still passes (the :443 probe uses `validate_certs: false`).
- [ ] **Step 3: Confirm a real staging cert issued**`make test-integration HOST=askari CERTS=le-staging KEEP=1`, then:
```bash
NAME=$(.venv/bin/python -c "print(open('tests/integration/.run/current').read().split()[0])")
IP=$(sed -n 2p tests/integration/.run/current)
ssh ansible@$IP "sudo docker exec caddy ls /data/caddy/certificates" # adjust to the caddy data path
```
Expected: a cert dir under an `acme-staging-v02...` issuer path (proves the DNS-01 staging path works end to end). Then `make test-integration-clean`.
- [ ] **Step 4: Commit** (only if `overrides`/`certs` needed tweaks): `git commit -am "test(integration): validate le-staging DNS-01 cert path"`
---
## Phase H — Governance & docs
### Task 18: ADR-025
**Files:** Create `docs/decisions/025-local-vm-integration-testing.md`
- [ ] **Step 1: Write the ADR** — use `docs/decisions/adr-template.md`. Content (no placeholders — write these in full):
- **Status:** Accepted (2026-06-18).
- **Context:** Molecule (Level 1) can't catch reboot/firewall/Docker/boot-order bugs; the 2026-06-17 incident; ADR-008 Level 2/3 was deferred for lack of hosts but ubongo can host local KVM (verified `/dev/kvm` + VT-x).
- **Decision:** libvirt/KVM (Approach A), one throwaway VM at a time from real inventory ("be askari"), stdlib driver over `virsh`, tiered certs (`internal` default, `le-staging` built, `le-prod-wildcard` on-demand), Ansible-managed substrate role, stubs via `-e @` overlays.
- **Alternatives rejected:** Proxmox-nested (heavy, ADR-015 tension, bugs aren't in provisioning); Vagrant (Ruby/plugin footprint, box drift); terraform-provider-libvirt (poor at imperative reboot loop, blurs ADR-006).
- **Consequences:** new RAM load on ubongo (resource guard + one-at-a-time); reconciles ADR-015; accepted risk for `le-prod-wildcard`. Cross-reference ADR-008/015/006/024/016/020.
- [ ] **Step 2: Commit.** `git commit -am "docs(adr): ADR-025 local VM integration testing"`
### Task 19: pointers + entries
**Files:** Modify `docs/decisions/008-testing.md`, `docs/decisions/015-control-host.md`, `docs/security/accepted-risks.md`, `CLAUDE.md`, `STATUS.md`, `docs/TODO.md`, `docs/hardware/reference.md`
- [ ] **Step 1: ADR-008** — in the "what Molecule does NOT test" section, add a line: reboot-survivability / host-firewall×Docker / boot-order are now covered by **local VM integration testing (ADR-025)**; add ADR-025 to the Level 2/3 description as its concrete build.
- [ ] **Step 2: ADR-015** — one line: ubongo runs **ephemeral KVM test VMs** as part of its local-test-runner role (ADR-025) — still not a production hypervisor; note the test-VM RAM load against the 16 GiB sizing.
- [ ] **Step 3: accepted-risks.md** — add an entry: *le-prod-wildcard integration runs* — the production Gandi PAT (`vault.gandi.pat`) reaches an ephemeral local VM and transient `_acme-challenge` TXT records are written into the real `wingu.me` zone. Scope: on-demand only; staging is the default. Compensating: ephemeral VM, NAT-isolated, TXT auto-removed by Caddy. Owner/date.
- [ ] **Step 4: CLAUDE.md** — add to the key-commands table:
```
| Integration-test a host on a local VM | `make test-integration HOST=<name> [CERTS=…]` |
| Clean up integration test VMs | `make test-integration-clean` |
```
- [ ] **Step 5: STATUS.md** — add `roles/integration_test/` + `scripts/integration-vm.py` to "Built + working"; note the RED→GREEN acceptance passed.
- [ ] **Step 6: TODO.md** — collapse item 2.4 to a one-line pointer: "→ ADR-025 / `make test-integration` (built 2026-06-18)." (Do NOT renumber other items.)
- [ ] **Step 7: hardware/reference.md** — add a note to ubongo's row/workloads: one integration VM (~3 GiB) at a time; don't run alongside a heavy Level-4 browser session.
- [ ] **Step 8: Commit.** `git commit -am "docs: wire ADR-025 into testing/control-host/risks/status/todo/capacity"`
### Task 20: runbook
**Files:** Create `docs/runbooks/integration-testing.md`
- [ ] **Step 1: Write it** — sections: when to use it (firewall/sshd/boot/Docker changes, operationalises the standing "test risky infra before live deploy" rule + FRICTION #6 "validate reboot-recovery before retiring break-glass"); commands (`cycle`/`up`/`apply`/`reboot`/`assert`/`down`/`prune`/`console`, `--certs`, `--keep`); where diagnostics land (`~/integration-runs/`); how to inspect a kept failed VM (`virsh console`, ssh); the safety invariants; adding a new profile (a `profiles/<host>.json` + `overrides/<host>.yml`); the cert tiers and when to use each.
- [ ] **Step 2: Add a pre-flight line** to `docs/runbooks/new-host.md` and the hardening runbook: before a lockout-risky change, `make test-integration HOST=<name>` and confirm reboot-recovery while the break-glass is still open.
- [ ] **Step 3: Commit.** `git commit -am "docs(runbook): integration-testing runbook + pre-flight cross-links"`
---
## Deferred (out of v1 scope — track in TODO/FRICTION, not this plan)
- **Task 21 (follow-on): coordinator fidelity** — add `netbird_coordinator` to the askari profile's `applies` + the geo-DB stub var (needs reading `roles/netbird_coordinator/`), so signals #3 (mesh-bootstrap circularity) and #4 (egress FATAL-loop) reproduce. v1 gate is #1 only.
- **`le-prod-wildcard` issuance/persistence** — issue `*.test.wingu.me` once, persist on ubongo, mount into the VM. Wired (cert file exists) but unused until needed.
- **Multi-VM mini-staging** — inter-host mesh/dataplane.
- **Snapshot/`reset`** — post-apply libvirt snapshot for fast re-runs without re-applying base roles.
---
## Self-Review
**Spec coverage:** Approach A → Tasks 6-10. Substrate role → Task 1. Single-VM "be askari" → Tasks 12/15. Acceptance red→green → Tasks 15/16. Tiered certs (`internal`+`le-staging` built, `le-prod-wildcard` wired) → Tasks 11/12/17. Ansible-managed substrate → Task 1. Stubs in overlay (not inventory) → Task 12 (`-e @`). Safety invariants → Task 5 (single-host inv) + Task 12 (`mesh_enabled: false`) + Task 7 (isolated NAT). Resource guard / one-at-a-time → Task 7. Diagnostics → Task 9. Governance (ADR-025, ADR-008/015 pointers, accepted-risks, CLAUDE.md, runbook, STATUS, TODO, capacity) → Tasks 18-20. **Gap closed:** coordinator (#3/#4) explicitly deferred to Task 21 with the v1 gate stated as #1 — matches the spec's "minimum credible v1 is the red half" scoping.
**Placeholder scan:** none — `_destroy`'s `--nvram` and the caddy data path in Task 17 Step 3 carry "adjust to actual" notes (verification actions, not placeholders). The base nftables handler name is a confirm-then-use step (Task 16 Step 3), not a guess.
**Type/name consistency:** `vm_name/free_mib/parse_lease_ip/render_meta_data/render_user_data/cert_file/profile_path/render_run_hosts` (pure, Tasks 3-5) ↔ used by `up/apply/run_assert` (Tasks 7-9). `RUN_DIR/current` written by `up` (Task 7), read by `_read_current` (Task 8). `DISPATCH` keys ↔ argparse subcommands (Task 2/10). Profile JSON keys (`groups`/`applies`/`extra_vars_files`/`mem_mib`/`vcpus`) ↔ `apply` (Task 8) + `askari.json` (Task 12). Cert files ↔ `cert_file` (Task 5) + Task 12. `base__firewall_dropin_dir` ↔ Task 16 template dest.