1180 lines
49 KiB
Markdown
1180 lines
49 KiB
Markdown
|
|
# Local VM Integration Testing Implementation Plan
|
|||
|
|
|
|||
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|||
|
|
|
|||
|
|
**Goal:** Give the agent a `make test-integration HOST=<name>` loop that boots a throwaway KVM VM on ubongo mirroring a real host, applies the real playbooks, performs a **real reboot**, and asserts outcomes — catching the reboot/firewall/Docker class Molecule cannot (the 2026-06-17 incident).
|
|||
|
|
|
|||
|
|
**Architecture:** A non-service `integration_test` role installs the libvirt/QEMU substrate on ubongo. A stdlib-only driver `scripts/integration-vm.py` orchestrates the lifecycle over `virsh`/`virt-install`/`cloud-localds` (golden Debian-13 image → ephemeral qcow2 overlay → cloud-init seed → boot → apply real playbooks via a single-host transient inventory → reboot → verify playbook → teardown). Stubs and cert-tiers are passed as Ansible `-e @file` extra-vars so the real inventory is never edited and the driver never parses YAML.
|
|||
|
|
|
|||
|
|
**Tech Stack:** Debian 13 (trixie), libvirt 11.3 / `virt-install` 5.0.0 / QEMU-KVM, cloud-init NoCloud (`cloud-image-utils` 0.33), Ansible, Caddy v2 (DNS-01 via the existing `caddy-gandi` image), Python 3 stdlib, pytest, Molecule (Docker).
|
|||
|
|
|
|||
|
|
**Verified facts (ADR-014, 2026-06-18):**
|
|||
|
|
- Image: `https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2` + `SHA512SUMS` alongside. Ships cloud-init; **no qemu-guest-agent** → get IP via `virsh domifaddr <dom> --source lease`.
|
|||
|
|
- Seed: `cloud-localds seed.img user-data [meta-data]` (`cloud-image-utils`). Label `cidata`.
|
|||
|
|
- `virt-install --import --disk path=...,format=qcow2 --disk path=seed.img,device=cdrom --network network=<net> --osinfo debian13 --graphics none --serial file,path=<log> --noautoconsole` (package `virt-install`; `virtinst` is a transitional shim).
|
|||
|
|
- Isolated NAT net via `virsh net-define/net-start/net-autostart` (own bridge+subnet, `<forward mode='nat'/>`).
|
|||
|
|
- Caddy: `acme_ca https://acme-staging-v02.api.letsencrypt.org/directory` (global), `tls internal` (self-signed), `tls { dns gandi {env.GANDI_BEARER_TOKEN} }` (DNS-01; module already compiled into the boma `caddy-gandi` image). LE staging limits are effectively unlimited; use staging for routine cert tests.
|
|||
|
|
|
|||
|
|
**Repo facts this plan extends:**
|
|||
|
|
- `roles/base/templates/nftables.conf.j2:21` — `chain forward { ... policy drop; }`; line 26 `include "{{ base__firewall_dropin_dir }}/*.nft"`; `base__firewall_dropin_dir: /etc/nftables.d`. **The drop-in include already exists** — `docker_host` just needs to ship a `.nft` file.
|
|||
|
|
- `base__firewall_apply` gates application (`roles/base/tasks/firewall.yml:32-35`).
|
|||
|
|
- `roles/docker_host/` installs Docker only; **no container-forward rules** (the green-half fix).
|
|||
|
|
- `roles/reverse_proxy/templates/Caddyfile.j2` — global `acme_dns gandi {env.GANDI_BEARER_TOKEN}` when `reverse_proxy__acme_dns_provider == 'gandi'`; per-site blocks; Gandi PAT via `vault.gandi.pat` → `env.j2` `GANDI_BEARER_TOKEN`. **No `acme_ca` or `tls internal` knob yet** (this plan adds them).
|
|||
|
|
- askari: `inventories/production/offsite.yml` (`ansible_host: 77.42.120.136`, group `offsite_hosts`); `group_vars/offsite_hosts/vars.yml` (`base__firewall_apply: false`, `base__ssh_listen_mesh_only: false`); routes in `group_vars/all/reverse_proxy.yml`.
|
|||
|
|
- `playbooks/site.yml` (base→all, docker_host→docker_hosts) + `playbooks/offsite.yml` (docker_host→reverse_proxy→netbird_coordinator on offsite_hosts).
|
|||
|
|
- Makefile vars: `VENV PLAYBOOK_BIN INVENTORY VAULT_ARGS ROLE PLAYBOOK LIMIT TAGS`. pytest in `tests/test_*.py` (no conftest/pytest.ini; importlib-load of hyphenated scripts, see `tests/test_firewall_rules.py:1-13`). Tag vocabulary `tests/tags.yml`; `scripts/check-tags.py` run by `make lint`.
|
|||
|
|
- None of `roles/integration_test/`, `scripts/integration-vm.py`, `tests/integration/` exist.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## File Structure
|
|||
|
|
|
|||
|
|
**Create:**
|
|||
|
|
- `roles/integration_test/` — substrate role (defaults, tasks, handlers, meta, README, molecule/default/{molecule,converge,verify}.yml). Installs libvirt/QEMU/virt-install/cloud-image-utils; enables `libvirtd`; adds `sjat`/`claude` to `libvirt`+`kvm` groups; creates the image cache dir.
|
|||
|
|
- `scripts/integration-vm.py` — stdlib-only driver. Pure helpers + impure orchestration + argparse CLI.
|
|||
|
|
- `tests/test_integration_vm.py` — pytest for the driver's pure helpers.
|
|||
|
|
- `tests/integration/profiles/askari.json` — driver-side profile metadata (groups, playbook+tags list, extra-vars files, mem/vcpu).
|
|||
|
|
- `tests/integration/overrides/askari.yml` — Ansible stub extra-vars (firewall on, ssh break-glass).
|
|||
|
|
- `tests/integration/certs/{internal,le-staging,le-prod-wildcard}.yml` — cert-tier extra-vars.
|
|||
|
|
- `tests/integration/verify.yml` — outcome-based verify playbook.
|
|||
|
|
- `tests/integration/README.md` — how the harness works.
|
|||
|
|
- `docs/decisions/025-local-vm-integration-testing.md` — ADR.
|
|||
|
|
- `docs/runbooks/integration-testing.md` — operator/agent runbook.
|
|||
|
|
|
|||
|
|
**Modify:**
|
|||
|
|
- `roles/reverse_proxy/defaults/main.yml` + `templates/Caddyfile.j2` — add `reverse_proxy__tls_internal` + `reverse_proxy__acme_ca` knobs.
|
|||
|
|
- `roles/docker_host/defaults/main.yml` + `tasks/main.yml` + new `templates/10-docker-forward.nft.j2` — the container-forward drop-in (green-half).
|
|||
|
|
- `Makefile` — `test-integration`, `test-integration-clean` targets.
|
|||
|
|
- `.gitignore` — `tests/integration/.run/`, `/integration-runs/` is under $HOME (already outside repo).
|
|||
|
|
- `docs/decisions/008-testing.md`, `015-control-host.md`; `docs/security/accepted-risks.md`; `CLAUDE.md`; `STATUS.md`; `docs/TODO.md`; `docs/hardware/reference.md` — pointers/entries.
|
|||
|
|
|
|||
|
|
**Milestones:** RED (Task 15: harness reproduces the incident) → GREEN (Task 16: docker_host fix survives reboot) → le-staging cert tier (Task 17) → governance/docs (Tasks 18-20).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase A — Substrate role
|
|||
|
|
|
|||
|
|
### Task 1: `integration_test` role (libvirt/QEMU substrate)
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `roles/integration_test/{defaults,tasks,handlers,meta}/main.yml`, `roles/integration_test/README.md`, `roles/integration_test/molecule/default/{molecule,converge,verify}.yml`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Scaffold**
|
|||
|
|
|
|||
|
|
Run: `make new-role NAME=integration_test`
|
|||
|
|
Expected: `Role integration_test scaffolded at roles/integration_test/`
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: defaults/main.yml**
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
# integration_test — installs the local KVM/libvirt substrate on the control node
|
|||
|
|
# (ubongo) so the agent can run throwaway VM integration tests (ADR-025). Non-service
|
|||
|
|
# role; applied to the `control` group. Not a production hypervisor (ADR-015).
|
|||
|
|
integration_test__packages:
|
|||
|
|
- qemu-system-x86 # KVM
|
|||
|
|
- qemu-utils # qemu-img (overlays)
|
|||
|
|
- libvirt-daemon-system
|
|||
|
|
- libvirt-clients # virsh
|
|||
|
|
- virt-install # virt-install (trixie: the real pkg; `virtinst` is transitional)
|
|||
|
|
- cloud-image-utils # cloud-localds (NoCloud seed)
|
|||
|
|
- genisoimage # cloud-localds fallback
|
|||
|
|
# Users granted libvirt/kvm access (run VMs without sudo).
|
|||
|
|
integration_test__users:
|
|||
|
|
- sjat
|
|||
|
|
- claude
|
|||
|
|
# Where the golden image + overlays live (outside the repo).
|
|||
|
|
integration_test__cache_dir: "/var/lib/boma-integration"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: tasks/main.yml**
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
- name: Install the KVM/libvirt substrate
|
|||
|
|
ansible.builtin.apt:
|
|||
|
|
name: "{{ integration_test__packages }}"
|
|||
|
|
state: present
|
|||
|
|
update_cache: true
|
|||
|
|
tags: [packages]
|
|||
|
|
|
|||
|
|
- name: Enable and start libvirtd
|
|||
|
|
ansible.builtin.systemd:
|
|||
|
|
name: libvirtd
|
|||
|
|
enabled: true
|
|||
|
|
state: started
|
|||
|
|
tags: [config]
|
|||
|
|
|
|||
|
|
- name: Grant users libvirt + kvm access
|
|||
|
|
ansible.builtin.user:
|
|||
|
|
name: "{{ item }}"
|
|||
|
|
groups: [libvirt, kvm]
|
|||
|
|
append: true
|
|||
|
|
loop: "{{ integration_test__users }}"
|
|||
|
|
tags: [users]
|
|||
|
|
|
|||
|
|
- name: Create the integration cache dir
|
|||
|
|
ansible.builtin.file:
|
|||
|
|
path: "{{ integration_test__cache_dir }}"
|
|||
|
|
state: directory
|
|||
|
|
owner: root
|
|||
|
|
group: libvirt
|
|||
|
|
mode: "2775"
|
|||
|
|
tags: [config]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: meta/main.yml** (mirror `roles/dev_env/meta/main.yml`: author `sjat`, Debian/trixie, `min_ansible_version: "2.17"`, `dependencies: []`, description naming ADR-025). **handlers/main.yml** stays `---` (no handlers). **README.md**: purpose, that it targets the `control` group, links ADR-025/ADR-015.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: molecule/default/molecule.yml** — copy `roles/dev_env/molecule/default/molecule.yml` verbatim (same Debian-13 systemd image).
|
|||
|
|
|
|||
|
|
- [ ] **Step 6: molecule/default/converge.yml**
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
- name: Converge
|
|||
|
|
hosts: all
|
|||
|
|
become: true
|
|||
|
|
gather_facts: true
|
|||
|
|
roles:
|
|||
|
|
- role: integration_test
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 7: molecule/default/verify.yml** (assert install tasks — NOT libvirtd active, which cannot run KVM-in-Docker)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
- name: Verify
|
|||
|
|
hosts: all
|
|||
|
|
become: true
|
|||
|
|
gather_facts: false
|
|||
|
|
tasks:
|
|||
|
|
- name: Gather package facts
|
|||
|
|
ansible.builtin.package_facts:
|
|||
|
|
- name: Assert the substrate packages are installed
|
|||
|
|
ansible.builtin.assert:
|
|||
|
|
that:
|
|||
|
|
- "'libvirt-clients' in ansible_facts.packages"
|
|||
|
|
- "'virt-install' in ansible_facts.packages"
|
|||
|
|
- "'cloud-image-utils' in ansible_facts.packages"
|
|||
|
|
- "'qemu-system-x86' in ansible_facts.packages"
|
|||
|
|
- name: Cache dir exists
|
|||
|
|
ansible.builtin.stat:
|
|||
|
|
path: /var/lib/boma-integration
|
|||
|
|
register: _cache
|
|||
|
|
- name: Assert cache dir
|
|||
|
|
ansible.builtin.assert:
|
|||
|
|
that: [_cache.stat.isdir]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 8: Add the role to the control-node play.** Edit `playbooks/workstation.yml` (the control-node playbook that applies `dev_env`) to also import `integration_test` for `control`. Confirm the exact play first:
|
|||
|
|
|
|||
|
|
Run: `grep -n "dev_env\|hosts:\|control" playbooks/workstation.yml`
|
|||
|
|
Then add under the same `control` play's roles:
|
|||
|
|
```yaml
|
|||
|
|
- role: integration_test
|
|||
|
|
tags: [integration_test]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 9: Lint + Molecule**
|
|||
|
|
|
|||
|
|
Run: `make lint`
|
|||
|
|
Expected: clean (new role-name tag `integration_test` auto-accepted by check-tags; concern tags `packages`/`config`/`users` are in `tests/tags.yml`).
|
|||
|
|
Run: `make test ROLE=integration_test`
|
|||
|
|
Expected: converge + idempotence + verify PASS.
|
|||
|
|
|
|||
|
|
- [ ] **Step 10: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add roles/integration_test playbooks/workstation.yml
|
|||
|
|
git commit -m "feat(integration_test): KVM/libvirt substrate role on the control node"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase B — Driver: pure helpers (TDD)
|
|||
|
|
|
|||
|
|
### Task 2: Driver skeleton + constants + CLI dispatch
|
|||
|
|
|
|||
|
|
**Files:**
|
|||
|
|
- Create: `scripts/integration-vm.py`
|
|||
|
|
- Test: `tests/test_integration_vm.py`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Write the failing test** (`tests/test_integration_vm.py`)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import importlib.util
|
|||
|
|
import pathlib
|
|||
|
|
|
|||
|
|
_PATH = pathlib.Path(__file__).resolve().parent.parent / "scripts" / "integration-vm.py"
|
|||
|
|
_spec = importlib.util.spec_from_file_location("integration_vm", _PATH)
|
|||
|
|
ivm = importlib.util.module_from_spec(_spec)
|
|||
|
|
_spec.loader.exec_module(ivm)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def test_valid_tiers():
|
|||
|
|
assert ivm.VALID_TIERS == ("internal", "le-staging", "le-prod-wildcard")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Run it — fails (file missing)**
|
|||
|
|
|
|||
|
|
Run: `.venv/bin/pytest tests/test_integration_vm.py -q`
|
|||
|
|
Expected: FAIL (cannot load `scripts/integration-vm.py`).
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Create the skeleton** (`scripts/integration-vm.py`)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
#!/usr/bin/env python3
|
|||
|
|
"""boma local-VM integration test harness driver (ADR-025).
|
|||
|
|
|
|||
|
|
Stdlib-only by convention (TODO-14): never imports a YAML library. The transient
|
|||
|
|
inventory is emitted via string templates; stubs/cert-tiers reach Ansible as
|
|||
|
|
`-e @<file>` extra-vars; profile metadata is JSON. Talks to libvirt via `virsh`.
|
|||
|
|
"""
|
|||
|
|
import argparse
|
|||
|
|
import hashlib
|
|||
|
|
import json
|
|||
|
|
import os
|
|||
|
|
import pathlib
|
|||
|
|
import re
|
|||
|
|
import shutil
|
|||
|
|
import subprocess
|
|||
|
|
import sys
|
|||
|
|
import time
|
|||
|
|
import urllib.request
|
|||
|
|
import uuid
|
|||
|
|
|
|||
|
|
REPO_ROOT = pathlib.Path(__file__).resolve().parent.parent
|
|||
|
|
CACHE_DIR = pathlib.Path(os.environ.get("BOMA_IT_CACHE", "/var/lib/boma-integration"))
|
|||
|
|
IMAGE_URL = "https://cloud.debian.org/images/cloud/trixie/latest/debian-13-genericcloud-amd64.qcow2"
|
|||
|
|
SHA_URL = "https://cloud.debian.org/images/cloud/trixie/latest/SHA512SUMS"
|
|||
|
|
IMAGE_NAME = "debian-13-genericcloud-amd64.qcow2"
|
|||
|
|
NET_NAME = "boma-it"
|
|||
|
|
NET_XML = """<network>
|
|||
|
|
<name>boma-it</name>
|
|||
|
|
<forward mode='nat'/>
|
|||
|
|
<bridge name='virbr-boma' stp='on' delay='0'/>
|
|||
|
|
<ip address='192.168.150.1' netmask='255.255.255.0'>
|
|||
|
|
<dhcp><range start='192.168.150.10' end='192.168.150.254'/></dhcp>
|
|||
|
|
</ip>
|
|||
|
|
</network>
|
|||
|
|
"""
|
|||
|
|
NAME_PREFIX = "boma-it-"
|
|||
|
|
RUN_DIR = REPO_ROOT / "tests" / "integration" / ".run"
|
|||
|
|
DIAG_ROOT = pathlib.Path.home() / "integration-runs"
|
|||
|
|
PROFILE_DIR = REPO_ROOT / "tests" / "integration" / "profiles"
|
|||
|
|
INTEG_DIR = REPO_ROOT / "tests" / "integration"
|
|||
|
|
CERT_DIR = REPO_ROOT / "tests" / "integration" / "certs"
|
|||
|
|
DEFAULT_MEM_MIB = 3072
|
|||
|
|
DEFAULT_VCPUS = 2
|
|||
|
|
MIN_FREE_MIB = 4096
|
|||
|
|
VALID_TIERS = ("internal", "le-staging", "le-prod-wildcard")
|
|||
|
|
|
|||
|
|
|
|||
|
|
def main(argv=None):
|
|||
|
|
p = argparse.ArgumentParser(prog="integration-vm", description=__doc__)
|
|||
|
|
sub = p.add_subparsers(dest="cmd", required=True)
|
|||
|
|
for c in ("up", "apply", "reboot", "assert", "cycle", "down", "console"):
|
|||
|
|
sp = sub.add_parser(c)
|
|||
|
|
sp.add_argument("--host", required=True)
|
|||
|
|
sp.add_argument("--certs", choices=VALID_TIERS, default="internal")
|
|||
|
|
sp.add_argument("--keep", action="store_true")
|
|||
|
|
sp.add_argument("--no-reboot", action="store_true")
|
|||
|
|
sub.add_parser("prune")
|
|||
|
|
args = p.parse_args(argv)
|
|||
|
|
return DISPATCH[args.cmd](args)
|
|||
|
|
|
|||
|
|
|
|||
|
|
if __name__ == "__main__": # pragma: no cover
|
|||
|
|
sys.exit(main())
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
(Define `DISPATCH = {...}` after the command functions in later tasks; for now add a temporary `DISPATCH = {}` above `main` so import succeeds.)
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Run — passes**
|
|||
|
|
|
|||
|
|
Run: `.venv/bin/pytest tests/test_integration_vm.py -q`
|
|||
|
|
Expected: PASS.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Commit**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
git add scripts/integration-vm.py tests/test_integration_vm.py
|
|||
|
|
git commit -m "feat(integration-vm): driver skeleton + CLI dispatch"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Task 3: `vm_name`, `free_mib`, `parse_lease_ip` (TDD)
|
|||
|
|
|
|||
|
|
**Files:** Modify `scripts/integration-vm.py`, `tests/test_integration_vm.py`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Write failing tests**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def test_vm_name_prefix_and_suffix():
|
|||
|
|
assert ivm.vm_name("askari", "ab12cd34") == "boma-it-askari-ab12cd34"
|
|||
|
|
|
|||
|
|
def test_vm_name_generates_suffix():
|
|||
|
|
n = ivm.vm_name("askari")
|
|||
|
|
assert n.startswith("boma-it-askari-") and len(n.split("-")[-1]) == 8
|
|||
|
|
|
|||
|
|
def test_free_mib_parses_memavailable():
|
|||
|
|
sample = "MemTotal: 16331156 kB\nMemAvailable: 8388608 kB\n"
|
|||
|
|
assert ivm.free_mib(sample) == 8192
|
|||
|
|
|
|||
|
|
def test_parse_lease_ip_extracts_ipv4():
|
|||
|
|
out = (" Name MAC address Protocol Address\n"
|
|||
|
|
"-------------------------------------------------------------------\n"
|
|||
|
|
" vnet0 52:54:00:aa:bb:cc ipv4 192.168.150.42/24\n")
|
|||
|
|
assert ivm.parse_lease_ip(out) == "192.168.150.42"
|
|||
|
|
|
|||
|
|
def test_parse_lease_ip_none_when_absent():
|
|||
|
|
assert ivm.parse_lease_ip("no leases\n") is None
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Run — fail.** `.venv/bin/pytest tests/test_integration_vm.py -q` → FAIL (no attrs).
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Implement** (add to `scripts/integration-vm.py`)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def vm_name(host, suffix=None):
|
|||
|
|
suffix = suffix or uuid.uuid4().hex[:8]
|
|||
|
|
return f"{NAME_PREFIX}{host}-{suffix}"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def free_mib(meminfo_text):
|
|||
|
|
m = re.search(r"^MemAvailable:\s+(\d+)\s+kB", meminfo_text, re.MULTILINE)
|
|||
|
|
return int(m.group(1)) // 1024 if m else 0
|
|||
|
|
|
|||
|
|
|
|||
|
|
def parse_lease_ip(domifaddr_output):
|
|||
|
|
m = re.search(r"ipv4\s+(\d+\.\d+\.\d+\.\d+)", domifaddr_output)
|
|||
|
|
return m.group(1) if m else None
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Run — pass.** `.venv/bin/pytest tests/test_integration_vm.py -q` → PASS.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Commit.** `git commit -am "feat(integration-vm): vm naming, RAM guard, lease IP parsing"`
|
|||
|
|
|
|||
|
|
### Task 4: cloud-init `render_meta_data` / `render_user_data` (TDD)
|
|||
|
|
|
|||
|
|
**Files:** Modify driver + tests
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Write failing tests**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def test_meta_data_has_instance_and_hostname():
|
|||
|
|
md = ivm.render_meta_data("iid-askari-x", "boma-it-askari-x")
|
|||
|
|
assert "instance-id: iid-askari-x" in md
|
|||
|
|
assert "local-hostname: boma-it-askari-x" in md
|
|||
|
|
|
|||
|
|
def test_user_data_injects_key_and_ansible_user():
|
|||
|
|
ud = ivm.render_user_data("ssh-ed25519 AAAA... claude@ubongo", "ansible")
|
|||
|
|
assert ud.startswith("#cloud-config")
|
|||
|
|
assert "name: ansible" in ud
|
|||
|
|
assert "ssh-ed25519 AAAA... claude@ubongo" in ud
|
|||
|
|
assert "NOPASSWD:ALL" in ud
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Run — fail.**
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Implement**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def render_meta_data(instance_id, hostname):
|
|||
|
|
return f"instance-id: {instance_id}\nlocal-hostname: {hostname}\n"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def render_user_data(ssh_pubkey, ansible_user):
|
|||
|
|
return (
|
|||
|
|
"#cloud-config\n"
|
|||
|
|
"users:\n"
|
|||
|
|
f" - name: {ansible_user}\n"
|
|||
|
|
" sudo: 'ALL=(ALL) NOPASSWD:ALL'\n"
|
|||
|
|
" shell: /bin/bash\n"
|
|||
|
|
" ssh_authorized_keys:\n"
|
|||
|
|
f" - {ssh_pubkey}\n"
|
|||
|
|
"ssh_pwauth: false\n"
|
|||
|
|
"package_update: false\n"
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Run — pass.**
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Commit.** `git commit -am "feat(integration-vm): cloud-init user-data/meta-data rendering"`
|
|||
|
|
|
|||
|
|
### Task 5: `cert_file`, `profile_path`, `render_run_hosts` (TDD)
|
|||
|
|
|
|||
|
|
**Files:** Modify driver + tests
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Write failing tests**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def test_cert_file_valid_tier():
|
|||
|
|
p = ivm.cert_file("le-staging")
|
|||
|
|
assert p.name == "le-staging.yml" and p.parent.name == "certs"
|
|||
|
|
|
|||
|
|
def test_cert_file_rejects_bad_tier():
|
|||
|
|
import pytest
|
|||
|
|
with pytest.raises(ValueError):
|
|||
|
|
ivm.cert_file("bogus")
|
|||
|
|
|
|||
|
|
def test_render_run_hosts_single_host_in_groups():
|
|||
|
|
out = ivm.render_run_hosts("boma-it-askari-x", "192.168.150.42",
|
|||
|
|
"ansible", ["offsite_hosts"])
|
|||
|
|
assert "offsite_hosts:" in out
|
|||
|
|
assert "boma-it-askari-x:" in out
|
|||
|
|
assert "ansible_host: 192.168.150.42" in out
|
|||
|
|
assert "ansible_user: ansible" in out
|
|||
|
|
# invariant: the real askari host must NOT appear
|
|||
|
|
assert "askari:" not in out.replace("boma-it-askari-x:", "")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Run — fail.**
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Implement**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def cert_file(tier):
|
|||
|
|
if tier not in VALID_TIERS:
|
|||
|
|
raise ValueError(f"unknown cert tier: {tier}")
|
|||
|
|
return CERT_DIR / f"{tier}.yml"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def profile_path(host):
|
|||
|
|
return PROFILE_DIR / f"{host}.json"
|
|||
|
|
|
|||
|
|
|
|||
|
|
def render_run_hosts(name, ip, ansible_user, groups):
|
|||
|
|
lines = [
|
|||
|
|
"# Generated by scripts/integration-vm.py — transient, gitignored. Do not edit.",
|
|||
|
|
"# Single test host ONLY (safety invariant: no real host is ever in scope).",
|
|||
|
|
"all:",
|
|||
|
|
" children:",
|
|||
|
|
]
|
|||
|
|
for g in groups:
|
|||
|
|
lines += [
|
|||
|
|
f" {g}:",
|
|||
|
|
" hosts:",
|
|||
|
|
f" {name}:",
|
|||
|
|
f" ansible_host: {ip}",
|
|||
|
|
f" ansible_user: {ansible_user}",
|
|||
|
|
]
|
|||
|
|
return "\n".join(lines) + "\n"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Run — pass.**
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Commit.** `git commit -am "feat(integration-vm): cert-tier + profile + transient inventory rendering"`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase C — Driver: orchestration (impure)
|
|||
|
|
|
|||
|
|
### Task 6: `sh` helper + `ensure_image`
|
|||
|
|
|
|||
|
|
**Files:** Modify driver
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Implement the subprocess helper + image fetch**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def sh(cmd, check=True, capture=False, **kw):
|
|||
|
|
"""Run a command (list form). Logs the command to stderr."""
|
|||
|
|
print("+ " + " ".join(str(c) for c in cmd), file=sys.stderr)
|
|||
|
|
return subprocess.run(cmd, check=check,
|
|||
|
|
capture_output=capture, text=True, **kw)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def _expected_sha(sha_text, filename):
|
|||
|
|
for line in sha_text.splitlines():
|
|||
|
|
parts = line.split()
|
|||
|
|
if len(parts) == 2 and parts[1].lstrip("*") == filename:
|
|||
|
|
return parts[0]
|
|||
|
|
return None
|
|||
|
|
|
|||
|
|
|
|||
|
|
def ensure_image():
|
|||
|
|
CACHE_DIR.mkdir(parents=True, exist_ok=True)
|
|||
|
|
img = CACHE_DIR / IMAGE_NAME
|
|||
|
|
if img.exists():
|
|||
|
|
return img
|
|||
|
|
print(f"Downloading {IMAGE_URL} ...", file=sys.stderr)
|
|||
|
|
tmp = img.with_suffix(".part")
|
|||
|
|
urllib.request.urlretrieve(IMAGE_URL, tmp)
|
|||
|
|
sha_text = urllib.request.urlopen(SHA_URL).read().decode()
|
|||
|
|
want = _expected_sha(sha_text, IMAGE_NAME)
|
|||
|
|
if not want:
|
|||
|
|
tmp.unlink(missing_ok=True)
|
|||
|
|
raise SystemExit(f"checksum for {IMAGE_NAME} not found at {SHA_URL}")
|
|||
|
|
h = hashlib.sha512()
|
|||
|
|
with open(tmp, "rb") as fh:
|
|||
|
|
for chunk in iter(lambda: fh.read(1 << 20), b""):
|
|||
|
|
h.update(chunk)
|
|||
|
|
if h.hexdigest() != want:
|
|||
|
|
tmp.unlink(missing_ok=True)
|
|||
|
|
raise SystemExit("golden image SHA512 mismatch — refusing to use it")
|
|||
|
|
tmp.rename(img)
|
|||
|
|
return img
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Manual verification**
|
|||
|
|
|
|||
|
|
Run: `.venv/bin/python scripts/integration-vm.py prune` (after Task 10 adds `prune`; for now) — or test `ensure_image` directly:
|
|||
|
|
```bash
|
|||
|
|
.venv/bin/python -c "import importlib.util,pathlib; \
|
|||
|
|
s=importlib.util.spec_from_file_location('ivm','scripts/integration-vm.py'); \
|
|||
|
|
m=importlib.util.module_from_spec(s); s.loader.exec_module(m); print(m.ensure_image())"
|
|||
|
|
```
|
|||
|
|
Expected: downloads to `/var/lib/boma-integration/debian-13-genericcloud-amd64.qcow2`, SHA512 verified, prints the path. (Requires Task 1's role applied so the cache dir is group-writable, or run with sudo once.)
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): golden image fetch + SHA512 verification"`
|
|||
|
|
|
|||
|
|
### Task 7: `net_ensure`, `up` (boot a VM)
|
|||
|
|
|
|||
|
|
**Files:** Modify driver
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Implement**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def net_ensure():
|
|||
|
|
r = sh(["virsh", "net-info", NET_NAME], check=False, capture=True)
|
|||
|
|
if r.returncode != 0:
|
|||
|
|
xml = RUN_DIR / "net.xml"
|
|||
|
|
RUN_DIR.mkdir(parents=True, exist_ok=True)
|
|||
|
|
xml.write_text(NET_XML)
|
|||
|
|
sh(["virsh", "net-define", str(xml)])
|
|||
|
|
sh(["virsh", "net-autostart", NET_NAME])
|
|||
|
|
active = sh(["virsh", "net-info", NET_NAME], capture=True).stdout
|
|||
|
|
if "Active: yes" not in active:
|
|||
|
|
sh(["virsh", "net-start", NET_NAME])
|
|||
|
|
|
|||
|
|
|
|||
|
|
def _ssh_pubkey():
|
|||
|
|
for cand in ("id_ed25519.pub", "id_rsa.pub"):
|
|||
|
|
p = pathlib.Path.home() / ".ssh" / cand
|
|||
|
|
if p.exists():
|
|||
|
|
return p.read_text().strip()
|
|||
|
|
raise SystemExit("no SSH public key found in ~/.ssh")
|
|||
|
|
|
|||
|
|
|
|||
|
|
def up(host, name=None, mem_mib=DEFAULT_MEM_MIB, vcpus=DEFAULT_VCPUS):
|
|||
|
|
free = free_mib(pathlib.Path("/proc/meminfo").read_text())
|
|||
|
|
if free < MIN_FREE_MIB:
|
|||
|
|
raise SystemExit(f"refusing to start: only {free} MiB free (< {MIN_FREE_MIB})")
|
|||
|
|
running = sh(["virsh", "list", "--name"], capture=True).stdout.split()
|
|||
|
|
if any(n.startswith(NAME_PREFIX) for n in running):
|
|||
|
|
raise SystemExit("an integration VM is already running (one at a time); "
|
|||
|
|
"run `integration-vm prune` first")
|
|||
|
|
name = name or vm_name(host)
|
|||
|
|
img = ensure_image()
|
|||
|
|
net_ensure()
|
|||
|
|
RUN_DIR.mkdir(parents=True, exist_ok=True)
|
|||
|
|
overlay = RUN_DIR / f"{name}.qcow2"
|
|||
|
|
sh(["qemu-img", "create", "-f", "qcow2", "-F", "qcow2", "-b", str(img), str(overlay)])
|
|||
|
|
(RUN_DIR / "user-data").write_text(render_user_data(_ssh_pubkey(), "ansible"))
|
|||
|
|
(RUN_DIR / "meta-data").write_text(render_meta_data(f"iid-{name}", name))
|
|||
|
|
seed = RUN_DIR / f"{name}-seed.img"
|
|||
|
|
sh(["cloud-localds", str(seed), str(RUN_DIR / "user-data"), str(RUN_DIR / "meta-data")])
|
|||
|
|
DIAG_ROOT.mkdir(parents=True, exist_ok=True)
|
|||
|
|
console = DIAG_ROOT / f"{name}-console.log"
|
|||
|
|
sh(["virt-install", "--name", name, "--memory", str(mem_mib), "--vcpus", str(vcpus),
|
|||
|
|
"--import",
|
|||
|
|
"--disk", f"path={overlay},format=qcow2",
|
|||
|
|
"--disk", f"path={seed},device=cdrom",
|
|||
|
|
"--network", f"network={NET_NAME}",
|
|||
|
|
"--osinfo", "debian13",
|
|||
|
|
"--graphics", "none",
|
|||
|
|
"--serial", f"file,path={console}",
|
|||
|
|
"--noautoconsole"])
|
|||
|
|
ip = wait_for_ip(name)
|
|||
|
|
wait_for_ssh(ip, "ansible")
|
|||
|
|
(RUN_DIR / "current").write_text(f"{name}\n{ip}\n{host}\n")
|
|||
|
|
print(f"VM {name} up at {ip}")
|
|||
|
|
return name, ip
|
|||
|
|
|
|||
|
|
|
|||
|
|
def wait_for_ip(name, timeout=120):
|
|||
|
|
end = time.time() + timeout
|
|||
|
|
while time.time() < end:
|
|||
|
|
out = sh(["virsh", "domifaddr", name, "--source", "lease"],
|
|||
|
|
check=False, capture=True).stdout
|
|||
|
|
ip = parse_lease_ip(out)
|
|||
|
|
if ip:
|
|||
|
|
return ip
|
|||
|
|
time.sleep(4)
|
|||
|
|
raise SystemExit(f"timed out waiting for {name} to get a DHCP lease")
|
|||
|
|
|
|||
|
|
|
|||
|
|
def wait_for_ssh(ip, user, timeout=180):
|
|||
|
|
end = time.time() + timeout
|
|||
|
|
while time.time() < end:
|
|||
|
|
r = sh(["ssh", "-o", "StrictHostKeyChecking=no",
|
|||
|
|
"-o", "UserKnownHostsFile=/dev/null", "-o", "ConnectTimeout=5",
|
|||
|
|
f"{user}@{ip}", "true"], check=False, capture=True)
|
|||
|
|
if r.returncode == 0:
|
|||
|
|
return
|
|||
|
|
time.sleep(5)
|
|||
|
|
raise SystemExit(f"timed out waiting for SSH to {ip}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Manual smoke (real KVM — requires Task 1 applied to ubongo)**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
.venv/bin/python scripts/integration-vm.py up --host askari # via DISPATCH once Task 10 lands
|
|||
|
|
```
|
|||
|
|
Expected: golden image present, `boma-it` net active, overlay + seed created, VM boots, prints `VM boma-it-askari-<id> up at 192.168.150.x`. SSH in: `ssh ansible@<ip>` works.
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): network + VM boot (overlay, cloud-init seed, virt-install import)"`
|
|||
|
|
|
|||
|
|
### Task 8: `write_run_inventory`, `apply`
|
|||
|
|
|
|||
|
|
**Files:** Modify driver
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Implement**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def _read_current():
|
|||
|
|
txt = (RUN_DIR / "current").read_text().splitlines()
|
|||
|
|
return txt[0], txt[1], txt[2] # name, ip, host
|
|||
|
|
|
|||
|
|
|
|||
|
|
def write_run_inventory(name, ip, groups):
|
|||
|
|
RUN_DIR.mkdir(parents=True, exist_ok=True)
|
|||
|
|
(RUN_DIR / "hosts.yml").write_text(
|
|||
|
|
render_run_hosts(name, ip, "ansible", groups))
|
|||
|
|
link = RUN_DIR / "group_vars"
|
|||
|
|
target = REPO_ROOT / "inventories" / "production" / "group_vars"
|
|||
|
|
if link.is_symlink() or link.exists():
|
|||
|
|
if link.is_symlink():
|
|||
|
|
link.unlink()
|
|||
|
|
if not link.exists():
|
|||
|
|
link.symlink_to(target)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def apply(host, certs):
|
|||
|
|
name, ip, _ = _read_current()
|
|||
|
|
prof = json.loads(profile_path(host).read_text())
|
|||
|
|
write_run_inventory(name, ip, prof["groups"])
|
|||
|
|
extra = []
|
|||
|
|
for f in prof.get("extra_vars_files", []):
|
|||
|
|
extra += ["-e", f"@{INTEG_DIR / f}"]
|
|||
|
|
extra += ["-e", f"@{cert_file(certs)}"]
|
|||
|
|
for step in prof["applies"]:
|
|||
|
|
cmd = [".venv/bin/ansible-playbook", "-i", str(RUN_DIR) + "/",
|
|||
|
|
f"playbooks/{step['playbook']}", "--limit", name]
|
|||
|
|
if step.get("tags"):
|
|||
|
|
cmd += ["--tags", ",".join(step["tags"])]
|
|||
|
|
cmd += extra
|
|||
|
|
sh(cmd, cwd=str(REPO_ROOT))
|
|||
|
|
print(f"applied {host} profile to {name}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Manual verification** — deferred to the Task 15 RED run (needs the profile/overlay/cert files from Phase D). Lint passes regardless.
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): transient inventory + real-playbook apply"`
|
|||
|
|
|
|||
|
|
### Task 9: `reboot_vm`, `run_assert`, `dump_diagnostics`
|
|||
|
|
|
|||
|
|
**Files:** Modify driver
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Implement**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def reboot_vm():
|
|||
|
|
name, ip, _ = _read_current()
|
|||
|
|
sh(["virsh", "reboot", name])
|
|||
|
|
time.sleep(5)
|
|||
|
|
wait_for_ssh(ip, "ansible")
|
|||
|
|
print(f"{name} rebooted, SSH back at {ip}")
|
|||
|
|
|
|||
|
|
|
|||
|
|
def run_assert(host, certs):
|
|||
|
|
name, ip, _ = _read_current()
|
|||
|
|
prof = json.loads(profile_path(host).read_text())
|
|||
|
|
write_run_inventory(name, ip, prof["groups"])
|
|||
|
|
extra = []
|
|||
|
|
for f in prof.get("extra_vars_files", []):
|
|||
|
|
extra += ["-e", f"@{INTEG_DIR / f}"]
|
|||
|
|
extra += ["-e", f"@{cert_file(certs)}"]
|
|||
|
|
cmd = [".venv/bin/ansible-playbook", "-i", str(RUN_DIR) + "/",
|
|||
|
|
"tests/integration/verify.yml", "--limit", name] + extra
|
|||
|
|
r = sh(cmd, cwd=str(REPO_ROOT), check=False)
|
|||
|
|
if r.returncode != 0:
|
|||
|
|
dump_diagnostics(name, ip)
|
|||
|
|
raise SystemExit(f"VERIFY FAILED for {name} — diagnostics in {DIAG_ROOT}")
|
|||
|
|
print(f"VERIFY PASSED for {name}")
|
|||
|
|
|
|||
|
|
|
|||
|
|
def dump_diagnostics(name, ip):
|
|||
|
|
d = DIAG_ROOT / name
|
|||
|
|
d.mkdir(parents=True, exist_ok=True)
|
|||
|
|
for label, cmd in [
|
|||
|
|
("nft", "nft list ruleset"),
|
|||
|
|
("docker", "docker ps -a"),
|
|||
|
|
("ss", "ss -tlnp"),
|
|||
|
|
("journal", "journalctl -b --no-pager"),
|
|||
|
|
("critical-chain", "systemd-analyze critical-chain"),
|
|||
|
|
]:
|
|||
|
|
r = sh(["ssh", "-o", "StrictHostKeyChecking=no",
|
|||
|
|
"-o", "UserKnownHostsFile=/dev/null",
|
|||
|
|
f"ansible@{ip}", "sudo " + cmd], check=False, capture=True)
|
|||
|
|
(d / f"{label}.txt").write_text((r.stdout or "") + (r.stderr or ""))
|
|||
|
|
console = DIAG_ROOT / f"{name}-console.log"
|
|||
|
|
if console.exists():
|
|||
|
|
shutil.copy(console, d / "console.log")
|
|||
|
|
print(f"diagnostics written to {d}", file=sys.stderr)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Commit.** `git commit -am "feat(integration-vm): reboot, verify run, failure diagnostics"`
|
|||
|
|
|
|||
|
|
### Task 10: `down`, `prune`, `console`, `cycle` + `DISPATCH`
|
|||
|
|
|
|||
|
|
**Files:** Modify driver
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Implement**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def _destroy(name):
|
|||
|
|
sh(["virsh", "destroy", name], check=False)
|
|||
|
|
sh(["virsh", "undefine", name, "--nvram"], check=False)
|
|||
|
|
for f in RUN_DIR.glob(f"{name}*"):
|
|||
|
|
f.unlink(missing_ok=True)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def down(host=None, keep=False):
|
|||
|
|
if keep:
|
|||
|
|
print("--keep: leaving the VM running for inspection")
|
|||
|
|
return
|
|||
|
|
cur = RUN_DIR / "current"
|
|||
|
|
if cur.exists():
|
|||
|
|
name = cur.read_text().splitlines()[0]
|
|||
|
|
_destroy(name)
|
|||
|
|
cur.unlink(missing_ok=True)
|
|||
|
|
print(f"destroyed {name}")
|
|||
|
|
|
|||
|
|
|
|||
|
|
def prune():
|
|||
|
|
running = sh(["virsh", "list", "--all", "--name"], capture=True).stdout.split()
|
|||
|
|
for n in running:
|
|||
|
|
if n.startswith(NAME_PREFIX):
|
|||
|
|
_destroy(n)
|
|||
|
|
print(f"pruned {n}")
|
|||
|
|
(RUN_DIR / "current").unlink(missing_ok=True)
|
|||
|
|
|
|||
|
|
|
|||
|
|
def console():
|
|||
|
|
name = (RUN_DIR / "current").read_text().splitlines()[0]
|
|||
|
|
log = DIAG_ROOT / f"{name}-console.log"
|
|||
|
|
print(log.read_text() if log.exists() else f"no console log at {log}")
|
|||
|
|
|
|||
|
|
|
|||
|
|
def cycle(host, certs, keep=False, no_reboot=False):
|
|||
|
|
try:
|
|||
|
|
up(host)
|
|||
|
|
apply(host, certs)
|
|||
|
|
if not no_reboot:
|
|||
|
|
reboot_vm()
|
|||
|
|
run_assert(host, certs)
|
|||
|
|
finally:
|
|||
|
|
# On success destroy; on failure (SystemExit) keep for inspection unless --keep flips it.
|
|||
|
|
if not keep:
|
|||
|
|
down(host)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Wire the dispatch (replace the temporary `DISPATCH = {}`):
|
|||
|
|
```python
|
|||
|
|
DISPATCH = {
|
|||
|
|
"up": lambda a: (up(a.host), None)[1],
|
|||
|
|
"apply": lambda a: apply(a.host, a.certs),
|
|||
|
|
"reboot": lambda a: reboot_vm(),
|
|||
|
|
"assert": lambda a: run_assert(a.host, a.certs),
|
|||
|
|
"down": lambda a: down(a.host, a.keep),
|
|||
|
|
"console": lambda a: console(),
|
|||
|
|
"prune": lambda a: prune(),
|
|||
|
|
"cycle": lambda a: cycle(a.host, a.certs, a.keep, a.no_reboot),
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
Fix `cycle`'s teardown semantics: on **failure** keep the VM (so it can be inspected); on **success** destroy. Implement by catching success explicitly:
|
|||
|
|
```python
|
|||
|
|
def cycle(host, certs, keep=False, no_reboot=False):
|
|||
|
|
ok = False
|
|||
|
|
try:
|
|||
|
|
up(host); apply(host, certs)
|
|||
|
|
if not no_reboot:
|
|||
|
|
reboot_vm()
|
|||
|
|
run_assert(host, certs)
|
|||
|
|
ok = True
|
|||
|
|
finally:
|
|||
|
|
if ok and not keep:
|
|||
|
|
down(host)
|
|||
|
|
elif not ok:
|
|||
|
|
print("FAILED — VM left up for inspection; `integration-vm prune` to clean.",
|
|||
|
|
file=sys.stderr)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Run unit tests + lint.** `.venv/bin/pytest tests/test_integration_vm.py -q` PASS; `make lint` clean.
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Commit.** `git commit -am "feat(integration-vm): teardown, prune, console, full cycle + dispatch"`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase D — Profile, cert `internal` tier, verify playbook
|
|||
|
|
|
|||
|
|
### Task 11: reverse_proxy `tls internal` + `acme_ca` knobs
|
|||
|
|
|
|||
|
|
**Files:** Modify `roles/reverse_proxy/defaults/main.yml`, `roles/reverse_proxy/templates/Caddyfile.j2`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: defaults** — append:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# Integration-test / staging cert knobs (ADR-025). Default off = production behaviour.
|
|||
|
|
reverse_proxy__tls_internal: false # true => every site uses Caddy's self-signed CA
|
|||
|
|
reverse_proxy__acme_ca: "" # set to the LE staging directory URL to use staging
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Caddyfile.j2** — in the global options block (after the `email` line), add:
|
|||
|
|
|
|||
|
|
```jinja
|
|||
|
|
{% if reverse_proxy__acme_ca %}
|
|||
|
|
acme_ca {{ reverse_proxy__acme_ca }}
|
|||
|
|
{% endif %}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
In each site block (inside `{{ r['host'] }} {`), add as the first directive:
|
|||
|
|
|
|||
|
|
```jinja
|
|||
|
|
{% if reverse_proxy__tls_internal %}
|
|||
|
|
tls internal
|
|||
|
|
{% endif %}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Molecule regression** — confirm `reverse_proxy` still renders. If the role has a Molecule scenario, run `make test ROLE=reverse_proxy`; else `make lint`.
|
|||
|
|
Expected: clean; default-off means production output is byte-identical (the `{% if %}` blocks emit nothing).
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Commit.** `git commit -am "feat(reverse_proxy): tls-internal + acme_ca knobs for integration/staging (ADR-025)"`
|
|||
|
|
|
|||
|
|
### Task 12: askari profile + overlay + cert-tier files
|
|||
|
|
|
|||
|
|
**Files:** Create `tests/integration/profiles/askari.json`, `tests/integration/overrides/askari.yml`, `tests/integration/certs/{internal,le-staging,le-prod-wildcard}.yml`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: `profiles/askari.json`**
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"groups": ["offsite_hosts"],
|
|||
|
|
"applies": [
|
|||
|
|
{"playbook": "site.yml", "tags": ["base"]},
|
|||
|
|
{"playbook": "offsite.yml", "tags": ["docker_host", "reverse_proxy"]}
|
|||
|
|
],
|
|||
|
|
"extra_vars_files": ["overrides/askari.yml"],
|
|||
|
|
"mem_mib": 3072,
|
|||
|
|
"vcpus": 2
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
(`netbird_coordinator` is intentionally omitted from v1 `applies` — Caddy's published :443 gives the DNAT that reproduces FRICTION #1. Coordinator fidelity (#3/#4) is a follow-on, Task 21.)
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: `overrides/askari.yml`** (Ansible extra-vars; highest precedence — never edits real inventory)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
# Integration-test overlay for the "askari" profile (ADR-025). Passed via `-e @`.
|
|||
|
|
# Reproduces the 2026-06-17 incident: apply base's nftables default-deny to a Docker host.
|
|||
|
|
base__firewall_apply: true
|
|||
|
|
# Keep a break-glass: sshd stays on all interfaces (never wt0-only in a throwaway VM).
|
|||
|
|
base__ssh_listen_mesh_only: false
|
|||
|
|
# The VM is isolated; it must never touch the real mesh.
|
|||
|
|
base__mesh_enabled: false
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: cert-tier files**
|
|||
|
|
|
|||
|
|
`certs/internal.yml`:
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
reverse_proxy__tls_internal: true
|
|||
|
|
```
|
|||
|
|
`certs/le-staging.yml`:
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
reverse_proxy__tls_internal: false
|
|||
|
|
reverse_proxy__acme_dns_provider: gandi
|
|||
|
|
reverse_proxy__acme_ca: "https://acme-staging-v02.api.letsencrypt.org/directory"
|
|||
|
|
```
|
|||
|
|
`certs/le-prod-wildcard.yml`:
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
# On-demand only. Records an accepted risk (ADR-025 / accepted-risks.md): the prod
|
|||
|
|
# Gandi PAT reaches an ephemeral VM and transient TXT records land in the real wingu.me.
|
|||
|
|
reverse_proxy__tls_internal: false
|
|||
|
|
reverse_proxy__acme_dns_provider: gandi
|
|||
|
|
reverse_proxy__acme_ca: ""
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Commit.** `git commit -am "feat(integration): askari profile, stub overlay, cert-tier files"`
|
|||
|
|
|
|||
|
|
### Task 13: verify playbook
|
|||
|
|
|
|||
|
|
**Files:** Create `tests/integration/verify.yml`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Write it**
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
# Integration verify (ADR-025). Outcome-based: proves Docker forwarding survives the
|
|||
|
|
# reboot. The load-bearing check probes the VM's published :443 FROM the controller
|
|||
|
|
# (ubongo) — if base's forward-drop killed DNAT, this times out (the FRICTION #1 bug).
|
|||
|
|
- name: Verify the rebooted host
|
|||
|
|
hosts: all
|
|||
|
|
become: true
|
|||
|
|
gather_facts: false
|
|||
|
|
tasks:
|
|||
|
|
- name: Docker daemon is active
|
|||
|
|
ansible.builtin.command: systemctl is-active docker
|
|||
|
|
changed_when: false
|
|||
|
|
|
|||
|
|
- name: Forward chain permits container traffic (drop-in loaded)
|
|||
|
|
ansible.builtin.command: nft list chain inet filter forward
|
|||
|
|
register: _fwd
|
|||
|
|
changed_when: false
|
|||
|
|
|
|||
|
|
- name: Assert container forwarding is allowed (not pure drop)
|
|||
|
|
ansible.builtin.assert:
|
|||
|
|
that: "'accept' in _fwd.stdout"
|
|||
|
|
fail_msg: >-
|
|||
|
|
forward chain is pure drop — container forwarding will die on reboot
|
|||
|
|
(FRICTION 2026-06-17 #1). docker_host container-forward drop-in missing.
|
|||
|
|
|
|||
|
|
- name: Published HTTPS port answers from the controller (DNAT + forward alive)
|
|||
|
|
delegate_to: localhost
|
|||
|
|
become: false
|
|||
|
|
ansible.builtin.uri:
|
|||
|
|
url: "https://{{ ansible_host }}/"
|
|||
|
|
validate_certs: false
|
|||
|
|
status_code: [200, 308, 404, 502, 503]
|
|||
|
|
timeout: 10
|
|||
|
|
register: _probe
|
|||
|
|
retries: 5
|
|||
|
|
delay: 6
|
|||
|
|
until: _probe is succeeded
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Lint.** `make lint` — clean (file is under `tests/`, not `playbooks/`, but keep tags valid; this play uses none, which is fine).
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Commit.** `git commit -am "feat(integration): outcome-based verify playbook (DNAT-survives-reboot)"`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase E — Makefile + RED milestone
|
|||
|
|
|
|||
|
|
### Task 14: Makefile targets + .gitignore
|
|||
|
|
|
|||
|
|
**Files:** Modify `Makefile`, `.gitignore`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Makefile** — add after the `test-all` target:
|
|||
|
|
|
|||
|
|
```makefile
|
|||
|
|
test-integration:
|
|||
|
|
ifndef HOST
|
|||
|
|
$(error HOST is required: make test-integration HOST=<name> [CERTS=internal|le-staging] [KEEP=1])
|
|||
|
|
endif
|
|||
|
|
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py cycle \
|
|||
|
|
--host $(HOST) $(if $(CERTS),--certs $(CERTS)) $(if $(KEEP),--keep)
|
|||
|
|
|
|||
|
|
test-integration-clean:
|
|||
|
|
PATH="$(CURDIR)/$(VENV)/bin:$$PATH" $(PYTHON) scripts/integration-vm.py prune
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Add both to `.PHONY` and the `help` block (match the existing style).
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: .gitignore** — add:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
# Integration-test transient run dir (ADR-025); diagnostics live under ~/integration-runs
|
|||
|
|
tests/integration/.run/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Commit.** `git commit -am "feat(make): test-integration / test-integration-clean targets"`
|
|||
|
|
|
|||
|
|
### Task 15: RED milestone — reproduce the incident
|
|||
|
|
|
|||
|
|
**Files:** none (a validation run); record the outcome.
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Pre-flight** — confirm `rbw unlocked` (the apply decrypts `group_vars/all/vault.yml`); confirm Task 1's role is applied to ubongo (`virsh version` works, you're in the `libvirt` group — may need a re-login).
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Run the cycle on TODAY's base (no docker_host fix yet)**
|
|||
|
|
|
|||
|
|
Run: `make test-integration HOST=askari`
|
|||
|
|
Expected: VM boots → base (firewall on) + docker_host + reverse_proxy apply → **reboot** → verify **FAILS** at "Assert container forwarding is allowed" and/or the :443 probe times out. Diagnostics appear under `~/integration-runs/boma-it-askari-<id>/` (nft shows `forward { policy drop }` with no accepts; the published port is dead).
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Confirm the failure is the RIGHT one** — read `~/integration-runs/<name>/nft.txt`: the `inet filter forward` chain is pure `policy drop`. This is the faithful reproduction of FRICTION #1. **If verify PASSES here, the harness is not faithful — stop and investigate** (e.g. Docker re-added its own accepts, or the firewall didn't apply).
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Clean up.** `make test-integration-clean`
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Record** — append a `[gotcha]`/milestone note to `docs/FRICTION.md` Open signals: "ADR-025 harness reproduced the 2026-06-17 firewall×Docker×reboot bug on a local VM (RED). Diagnostics: nft forward pure-drop, :443 DNAT dead post-reboot." Commit:
|
|||
|
|
```bash
|
|||
|
|
git commit -am "test(integration): RED — harness reproduces the 2026-06-17 incident"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase F — GREEN milestone (docker_host fix)
|
|||
|
|
|
|||
|
|
### Task 16: docker_host container-forward drop-in
|
|||
|
|
|
|||
|
|
**Files:** Modify `roles/docker_host/defaults/main.yml`, `roles/docker_host/tasks/main.yml`; Create `roles/docker_host/templates/10-docker-forward.nft.j2`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: defaults** — append:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# Container-forward nftables drop-in (FRICTION 2026-06-17 #1 / ADR-025). base's inet
|
|||
|
|
# filter forward chain is `policy drop`; a drop verdict there is final, so Docker's own
|
|||
|
|
# ip-filter accepts can't save forwarded container traffic. We append accepts to base's
|
|||
|
|
# forward chain via base's /etc/nftables.d/*.nft include. Only meaningful on hosts where
|
|||
|
|
# base__firewall_apply is true.
|
|||
|
|
docker_host__forward_dropin: true
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: template `templates/10-docker-forward.nft.j2`**
|
|||
|
|
|
|||
|
|
```jinja
|
|||
|
|
# {{ ansible_managed }}
|
|||
|
|
# Allow container forwarding through base's default-deny forward chain (ADR-025).
|
|||
|
|
table inet filter {
|
|||
|
|
chain forward {
|
|||
|
|
ct state established,related accept
|
|||
|
|
iifname "docker0" accept
|
|||
|
|
oifname "docker0" accept
|
|||
|
|
iifname "br-+" accept
|
|||
|
|
oifname "br-+" accept
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: tasks/main.yml** — append (after Docker install):
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
- name: Install the container-forward nftables drop-in
|
|||
|
|
ansible.builtin.template:
|
|||
|
|
src: 10-docker-forward.nft.j2
|
|||
|
|
dest: "{{ base__firewall_dropin_dir }}/10-docker-forward.nft"
|
|||
|
|
mode: "0644"
|
|||
|
|
when: docker_host__forward_dropin | bool
|
|||
|
|
notify: reload nftables
|
|||
|
|
tags: [firewall]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Confirm the handler name base exposes:
|
|||
|
|
Run: `grep -rn "listen:\|reload nftables\|nftables" roles/base/handlers/main.yml`
|
|||
|
|
Use base's actual handler `listen:` topic; if none fits, add a `docker_host` handler that runs `nft -f /etc/nftables.conf` (the same reload base uses). Show the handler you add in `roles/docker_host/handlers/main.yml`:
|
|||
|
|
```yaml
|
|||
|
|
---
|
|||
|
|
- name: reload nftables
|
|||
|
|
ansible.builtin.command: nft -f /etc/nftables.conf
|
|||
|
|
listen: reload nftables
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: GREEN run**
|
|||
|
|
|
|||
|
|
Run: `make test-integration HOST=askari`
|
|||
|
|
Expected: apply (now includes the drop-in) → reboot → verify **PASSES** (forward chain has `accept` rules; :443 answers from ubongo). This is the red→green proof.
|
|||
|
|
|
|||
|
|
If it still fails, read diagnostics and iterate the `.nft` rules (e.g. Docker's compose bridges, or a NAT/masquerade gap) — **this is exactly what the harness is for**. Keep iterating Step 2 until verify passes.
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: Idempotence + lint + Molecule.** `make lint`; `make test ROLE=docker_host` (add a Molecule assertion that the drop-in file renders if the role has a scenario).
|
|||
|
|
|
|||
|
|
- [ ] **Step 6: Commit.** `git commit -am "fix(docker_host): container-forward nftables drop-in survives reboot (FRICTION #1, ADR-025)"`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase G — le-staging cert tier
|
|||
|
|
|
|||
|
|
### Task 17: validate `--certs le-staging`
|
|||
|
|
|
|||
|
|
**Files:** none new (exercises Task 11/12); may tweak `overrides/askari.yml` if DNS-01 names need adjusting.
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Pre-flight** — `rbw unlocked` (the run needs `vault.gandi.pat` for DNS-01). The VM needs outbound egress (the `boma-it` NAT net provides it).
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Run with the staging cert tier**
|
|||
|
|
|
|||
|
|
Run: `make test-integration HOST=askari CERTS=le-staging`
|
|||
|
|
Expected: same apply, but Caddy now uses DNS-01 against LE **staging** (untrusted root) for the profile's route hostnames (under `wingu.me`, whose DNS lives at Gandi). Verify still passes (the :443 probe uses `validate_certs: false`).
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Confirm a real staging cert issued** — `make test-integration HOST=askari CERTS=le-staging KEEP=1`, then:
|
|||
|
|
```bash
|
|||
|
|
NAME=$(.venv/bin/python -c "print(open('tests/integration/.run/current').read().split()[0])")
|
|||
|
|
IP=$(sed -n 2p tests/integration/.run/current)
|
|||
|
|
ssh ansible@$IP "sudo docker exec caddy ls /data/caddy/certificates" # adjust to the caddy data path
|
|||
|
|
```
|
|||
|
|
Expected: a cert dir under an `acme-staging-v02...` issuer path (proves the DNS-01 staging path works end to end). Then `make test-integration-clean`.
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: Commit** (only if `overrides`/`certs` needed tweaks): `git commit -am "test(integration): validate le-staging DNS-01 cert path"`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Phase H — Governance & docs
|
|||
|
|
|
|||
|
|
### Task 18: ADR-025
|
|||
|
|
|
|||
|
|
**Files:** Create `docs/decisions/025-local-vm-integration-testing.md`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Write the ADR** — use `docs/decisions/adr-template.md`. Content (no placeholders — write these in full):
|
|||
|
|
- **Status:** Accepted (2026-06-18).
|
|||
|
|
- **Context:** Molecule (Level 1) can't catch reboot/firewall/Docker/boot-order bugs; the 2026-06-17 incident; ADR-008 Level 2/3 was deferred for lack of hosts but ubongo can host local KVM (verified `/dev/kvm` + VT-x).
|
|||
|
|
- **Decision:** libvirt/KVM (Approach A), one throwaway VM at a time from real inventory ("be askari"), stdlib driver over `virsh`, tiered certs (`internal` default, `le-staging` built, `le-prod-wildcard` on-demand), Ansible-managed substrate role, stubs via `-e @` overlays.
|
|||
|
|
- **Alternatives rejected:** Proxmox-nested (heavy, ADR-015 tension, bugs aren't in provisioning); Vagrant (Ruby/plugin footprint, box drift); terraform-provider-libvirt (poor at imperative reboot loop, blurs ADR-006).
|
|||
|
|
- **Consequences:** new RAM load on ubongo (resource guard + one-at-a-time); reconciles ADR-015; accepted risk for `le-prod-wildcard`. Cross-reference ADR-008/015/006/024/016/020.
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Commit.** `git commit -am "docs(adr): ADR-025 local VM integration testing"`
|
|||
|
|
|
|||
|
|
### Task 19: pointers + entries
|
|||
|
|
|
|||
|
|
**Files:** Modify `docs/decisions/008-testing.md`, `docs/decisions/015-control-host.md`, `docs/security/accepted-risks.md`, `CLAUDE.md`, `STATUS.md`, `docs/TODO.md`, `docs/hardware/reference.md`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: ADR-008** — in the "what Molecule does NOT test" section, add a line: reboot-survivability / host-firewall×Docker / boot-order are now covered by **local VM integration testing (ADR-025)**; add ADR-025 to the Level 2/3 description as its concrete build.
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: ADR-015** — one line: ubongo runs **ephemeral KVM test VMs** as part of its local-test-runner role (ADR-025) — still not a production hypervisor; note the test-VM RAM load against the 16 GiB sizing.
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: accepted-risks.md** — add an entry: *le-prod-wildcard integration runs* — the production Gandi PAT (`vault.gandi.pat`) reaches an ephemeral local VM and transient `_acme-challenge` TXT records are written into the real `wingu.me` zone. Scope: on-demand only; staging is the default. Compensating: ephemeral VM, NAT-isolated, TXT auto-removed by Caddy. Owner/date.
|
|||
|
|
|
|||
|
|
- [ ] **Step 4: CLAUDE.md** — add to the key-commands table:
|
|||
|
|
```
|
|||
|
|
| Integration-test a host on a local VM | `make test-integration HOST=<name> [CERTS=…]` |
|
|||
|
|
| Clean up integration test VMs | `make test-integration-clean` |
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- [ ] **Step 5: STATUS.md** — add `roles/integration_test/` + `scripts/integration-vm.py` to "Built + working"; note the RED→GREEN acceptance passed.
|
|||
|
|
|
|||
|
|
- [ ] **Step 6: TODO.md** — collapse item 2.4 to a one-line pointer: "→ ADR-025 / `make test-integration` (built 2026-06-18)." (Do NOT renumber other items.)
|
|||
|
|
|
|||
|
|
- [ ] **Step 7: hardware/reference.md** — add a note to ubongo's row/workloads: one integration VM (~3 GiB) at a time; don't run alongside a heavy Level-4 browser session.
|
|||
|
|
|
|||
|
|
- [ ] **Step 8: Commit.** `git commit -am "docs: wire ADR-025 into testing/control-host/risks/status/todo/capacity"`
|
|||
|
|
|
|||
|
|
### Task 20: runbook
|
|||
|
|
|
|||
|
|
**Files:** Create `docs/runbooks/integration-testing.md`
|
|||
|
|
|
|||
|
|
- [ ] **Step 1: Write it** — sections: when to use it (firewall/sshd/boot/Docker changes, operationalises the standing "test risky infra before live deploy" rule + FRICTION #6 "validate reboot-recovery before retiring break-glass"); commands (`cycle`/`up`/`apply`/`reboot`/`assert`/`down`/`prune`/`console`, `--certs`, `--keep`); where diagnostics land (`~/integration-runs/`); how to inspect a kept failed VM (`virsh console`, ssh); the safety invariants; adding a new profile (a `profiles/<host>.json` + `overrides/<host>.yml`); the cert tiers and when to use each.
|
|||
|
|
|
|||
|
|
- [ ] **Step 2: Add a pre-flight line** to `docs/runbooks/new-host.md` and the hardening runbook: before a lockout-risky change, `make test-integration HOST=<name>` and confirm reboot-recovery while the break-glass is still open.
|
|||
|
|
|
|||
|
|
- [ ] **Step 3: Commit.** `git commit -am "docs(runbook): integration-testing runbook + pre-flight cross-links"`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Deferred (out of v1 scope — track in TODO/FRICTION, not this plan)
|
|||
|
|
|
|||
|
|
- **Task 21 (follow-on): coordinator fidelity** — add `netbird_coordinator` to the askari profile's `applies` + the geo-DB stub var (needs reading `roles/netbird_coordinator/`), so signals #3 (mesh-bootstrap circularity) and #4 (egress FATAL-loop) reproduce. v1 gate is #1 only.
|
|||
|
|
- **`le-prod-wildcard` issuance/persistence** — issue `*.test.wingu.me` once, persist on ubongo, mount into the VM. Wired (cert file exists) but unused until needed.
|
|||
|
|
- **Multi-VM mini-staging** — inter-host mesh/dataplane.
|
|||
|
|
- **Snapshot/`reset`** — post-apply libvirt snapshot for fast re-runs without re-applying base roles.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Self-Review
|
|||
|
|
|
|||
|
|
**Spec coverage:** Approach A → Tasks 6-10. Substrate role → Task 1. Single-VM "be askari" → Tasks 12/15. Acceptance red→green → Tasks 15/16. Tiered certs (`internal`+`le-staging` built, `le-prod-wildcard` wired) → Tasks 11/12/17. Ansible-managed substrate → Task 1. Stubs in overlay (not inventory) → Task 12 (`-e @`). Safety invariants → Task 5 (single-host inv) + Task 12 (`mesh_enabled: false`) + Task 7 (isolated NAT). Resource guard / one-at-a-time → Task 7. Diagnostics → Task 9. Governance (ADR-025, ADR-008/015 pointers, accepted-risks, CLAUDE.md, runbook, STATUS, TODO, capacity) → Tasks 18-20. **Gap closed:** coordinator (#3/#4) explicitly deferred to Task 21 with the v1 gate stated as #1 — matches the spec's "minimum credible v1 is the red half" scoping.
|
|||
|
|
|
|||
|
|
**Placeholder scan:** none — `_destroy`'s `--nvram` and the caddy data path in Task 17 Step 3 carry "adjust to actual" notes (verification actions, not placeholders). The base nftables handler name is a confirm-then-use step (Task 16 Step 3), not a guess.
|
|||
|
|
|
|||
|
|
**Type/name consistency:** `vm_name/free_mib/parse_lease_ip/render_meta_data/render_user_data/cert_file/profile_path/render_run_hosts` (pure, Tasks 3-5) ↔ used by `up/apply/run_assert` (Tasks 7-9). `RUN_DIR/current` written by `up` (Task 7), read by `_read_current` (Task 8). `DISPATCH` keys ↔ argparse subcommands (Task 2/10). Profile JSON keys (`groups`/`applies`/`extra_vars_files`/`mem_mib`/`vcpus`) ↔ `apply` (Task 8) + `askari.json` (Task 12). Cert files ↔ `cert_file` (Task 5) + Task 12. `base__firewall_dropin_dir` ↔ Task 16 template dest.
|