- New docs/runbooks/integration-testing.md: when to use (firewall/ sshd/boot/Docker changes); make test-integration commands; lower- level driver sub-commands; cert tier guidance; diagnostics dir; VM inspection (virsh console / SSH); safety invariants; resource constraints; adding a new profile; self-validating acceptance test. - docs/runbooks/new-host.md: pre-flight warning before deploying lockout-risky changes (firewall/sshd/boot) while break-glass is open - docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
8.9 KiB
Runbook — Local VM integration testing
When to use this
Run a local VM integration test before deploying any change that touches:
- nftables / firewall rules (the
firewallconcern ofbase) - sshd configuration (listener address, port, key types,
basehardening) - boot ordering or kernel parameters (systemd units, sysctl)
- Docker host networking (
docker_hostDNAT rules, published-port forwarding,daemon.json)
These are the change classes that Molecule (ADR-008 Level 1) cannot catch: they require a real kernel reboot to surface. This harness is the concrete tool for ADR-008 Level 2/3 (see ADR-025) and directly operationalises two standing rules:
- "Test risky infra before live deploy" (standing rule, ubongo memory) — firewall/sshd/boot changes must be tested on a real VM with a real reboot before touching a live host.
- FRICTION 2026-06-17 #6 — validate reboot-recovery before retiring the break-glass — the lesson crystallised from the mesh-hardening incident: confirm the host recovers from reboot while you still have the break-glass open, not after.
You do not need this runbook for pure-config changes (template rendering, package lists, user management) — Molecule covers those.
First-deploy (one-time setup)
The integration_test role installs libvirt + QEMU + virtinst on ubongo and adds the
operator accounts (sjat, claude) to the libvirt and kvm groups.
make deploy PLAYBOOK=site LIMIT=ubongo TAGS=integration_test
Re-login after this run — group membership changes do not take effect in the current
session. The driver (scripts/integration-vm.py) requires both libvirt and kvm
group membership to create and manage VMs.
The golden Debian-13 genericcloud qcow2 image is downloaded lazily on the first run (one-time cost, ~500 MB); subsequent runs reuse the cached image.
Running a cycle
Makefile interface (recommended)
# Full cycle (provision → apply → reboot → assert → teardown on pass)
make test-integration HOST=askari
# With a specific cert tier
make test-integration HOST=askari CERTS=le-staging
# Keep the VM alive after the run (for manual inspection)
make test-integration HOST=askari KEEP=1
# Destroy all orphan integration VMs (name-prefix boma-it-*)
make test-integration-clean
HOST is a hostname from the production inventory (the profile tests/integration/ profiles/<host>.json must exist — see Adding a new profile below). CERTS defaults
to internal.
Lower-level driver
The driver (scripts/integration-vm.py) exposes individual lifecycle steps for manual
or scripted use:
| Sub-command | What it does |
|---|---|
up |
Ensure golden image → create ephemeral overlay → cloud-init seed → boot |
apply |
Run the site playbook against the transient inventory (real apply) |
reboot |
virsh reboot + wait for a verified reboot (boot-id change) — the step Molecule cannot do |
assert |
Run tests/integration/verify.yml (outcome assertions) |
cycle |
up → apply → reboot → assert → down (default: destroy on pass) |
down |
Destroy the VM + overlay |
prune |
Destroy all boma-it-* VMs + overlays (orphan cleanup) |
console |
Print the VM's captured serial-console log |
# Example: step through manually
python3 scripts/integration-vm.py up --host askari
python3 scripts/integration-vm.py apply --host askari
python3 scripts/integration-vm.py reboot --host askari
python3 scripts/integration-vm.py assert --host askari
python3 scripts/integration-vm.py down --host askari
Cert tiers
| Tier | Flag | Use when |
|---|---|---|
internal |
CERTS=internal (default) |
Incident repro, firewall/sshd/boot changes where certs are not under test. Zero deps, instant. |
le-staging |
CERTS=le-staging |
Testing the Caddy DNS-01 ACME path, cert renewal logic, or the caddy-gandi plugin. Real cert files, untrusted root, effectively no rate limits. Requires vault.gandi.pat. |
le-prod-wildcard |
CERTS=le-prod-wildcard |
Verifying TLS behaviour with a real trusted cert. On-demand only — accepted risk R6 (docs/security/accepted-risks.md): the production Gandi PAT reaches an ephemeral VM and transient TXT records are written into the real wingu.me zone. |
A deliberate "no-egress" scenario (reproducing FRICTION 2026-06-17 #4 — the
netbird-serverGeoLite2 FATAL-loop when NAT masquerade is wiped) must useCERTS=internal: the egress loss is the fault being simulated, and ACME requires egress.
Diagnostics and inspecting a failed VM
Where diagnostics land
Diagnostics from every run are captured in:
~/integration-runs/<timestamp>-<host>/
This directory is gitignored. On a failed assert step, the driver dumps:
nft list ruleset— the live nftables state at failuredocker ps -a— container statesss -tlnp— listening socketsjournalctl -b— full boot logsystemd-analyze critical-chain— boot timing- Serial console capture (on boot/SSH failure — the automated equivalent of the Hetzner console, addressing FRICTION 2026-06-17 #5)
The agent reads these directly from ~/integration-runs/ — no manual download needed.
Inspecting a kept or failed VM
When a run fails or when KEEP=1 is passed, the VM is left running. Connect to it:
# Serial console (no SSH needed — useful when SSH is the fault)
python3 scripts/integration-vm.py console --host askari
# or directly:
virsh console boma-it-askari
# Exit with Ctrl-]
# SSH (as the ansible user, IP from virsh)
virsh domifaddr boma-it-askari --source lease
ssh ansible@<IP>
# List all integration VMs
virsh list --all | grep boma-it-
Cleanup
# Destroy a specific VM
python3 scripts/integration-vm.py down --host askari
# Reap all orphans
make test-integration-clean
# or:
python3 scripts/integration-vm.py prune
Safety invariants
These make the test tool itself safe — the harness cannot reach or modify production:
- Single-host transient inventory — the playbook apply runs against a generated
single-host inventory (
ansible_host=<VM lease IP>). No real host is ever in scope. - In-VM coordinator only — "be askari" points NetBird at the coordinator running inside the VM itself (localhost endpoint). The VM forms its own one-node mesh; it never enrols in the real NetBird mesh.
- Isolated NAT network — test VMs sit on a dedicated libvirt NAT network.
Outbound NAT provides ACME/image-pull access, but the VM is not reachable from
the LAN (
10.20.x) or the real mesh.
Resource constraints
The default VM profile is ~2 vCPU / 3 GiB RAM / 20 GiB thin-provisioned overlay. The
driver enforces one integration VM at a time (refusing to start if another
boma-it-* VM is already running) and refuses to start below the free-RAM threshold
(~13 GiB available on ubongo at baseline, per ADR-025).
Do not run a test-integration cycle alongside a Level-4 browser session (Chromium/Playwright, ADR-017) — both compete for ubongo RAM. The resource guard is the enforcement mechanism, not a suggestion.
Adding a new profile
To make the harness "be" a different host:
- Create
tests/integration/profiles/<hostname>.json— specifies which roles to apply and base VM sizing for that host. - Create
tests/integration/overrides/<hostname>.yml— the explicit stub overlay: cert tier, in-VM coordinator endpoint (if the host runs the coordinator),ansible_hostplaceholder, and any other variables that must differ from the real inventory (e.g. public DNS → local resolution, geo-DB disable for coordinator). - Add assertions to
tests/integration/verify.yml(or extend an existing task with awhen: inventory_hostname == '<hostname>'guard) for any host-specific outcomes. - Run
make test-integration HOST=<hostname>to validate the new profile.
All stubs must be explicit in the overlay — the real inventory is never edited.
Reproducing the 2026-06-17 incident
The acceptance test for the harness (ADR-025) deliberately reproduces the incident:
-
Run with today's
base(firewall on, nodocker_hostcontainer-forward drop-in):make test-integration HOST=askari CERTS=internalThe assert step must FAIL after reboot (Docker forwarding dead, published ports unreachable). If it passes, the harness is not faithful.
-
Implement the
docker_hostcontainer-forward rules (FRICTION 2026-06-17 #1 fix) and re-run. The assert step must PASS across the reboot.
This round-trip proves: (a) the harness faithfully reproduces the incident, and (b) the fix survives a real reboot.
Related
- ADR-025 — decision record for this harness (approach, cert tiers, safety invariants)
- ADR-008 — testing methodology; this is Level 2/3
docs/security/accepted-risks.mdR6 —le-prod-wildcardaccepted riskdocs/FRICTION.md— 2026-06-17 signals that motivated this runbook