sjat/boma

sjat f51ae1a13d docs(runbook): integration-testing runbook + pre-flight cross-links

- New docs/runbooks/integration-testing.md: when to use (firewall/
  sshd/boot/Docker changes); make test-integration commands; lower-
  level driver sub-commands; cert tier guidance; diagnostics dir;
  VM inspection (virsh console / SSH); safety invariants; resource
  constraints; adding a new profile; self-validating acceptance test.
- docs/runbooks/new-host.md: pre-flight warning before deploying
  lockout-risky changes (firewall/sshd/boot) while break-glass is open
- docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-18 12:59:06 +02:00

8.9 KiB

Raw Blame History

Runbook — Local VM integration testing

When to use this

Run a local VM integration test before deploying any change that touches:

nftables / firewall rules (the firewall concern of base)
sshd configuration (listener address, port, key types, base hardening)
boot ordering or kernel parameters (systemd units, sysctl)
Docker host networking (docker_host DNAT rules, published-port forwarding, daemon.json)

These are the change classes that Molecule (ADR-008 Level 1) cannot catch: they require a real kernel reboot to surface. This harness is the concrete tool for ADR-008 Level 2/3 (see ADR-025) and directly operationalises two standing rules:

"Test risky infra before live deploy" (standing rule, ubongo memory) — firewall/sshd/boot changes must be tested on a real VM with a real reboot before touching a live host.
FRICTION 2026-06-17 #6 — validate reboot-recovery before retiring the break-glass — the lesson crystallised from the mesh-hardening incident: confirm the host recovers from reboot while you still have the break-glass open, not after.

You do not need this runbook for pure-config changes (template rendering, package lists, user management) — Molecule covers those.

First-deploy (one-time setup)

The integration_test role installs libvirt + QEMU + virtinst on ubongo and adds the operator accounts (sjat, claude) to the libvirt and kvm groups.

make deploy PLAYBOOK=site LIMIT=ubongo TAGS=integration_test

Re-login after this run — group membership changes do not take effect in the current session. The driver (scripts/integration-vm.py) requires both libvirt and kvm group membership to create and manage VMs.

The golden Debian-13 genericcloud qcow2 image is downloaded lazily on the first run (one-time cost, ~500 MB); subsequent runs reuse the cached image.

Running a cycle

Makefile interface (recommended)

# Full cycle (provision → apply → reboot → assert → teardown on pass)
make test-integration HOST=askari

# With a specific cert tier
make test-integration HOST=askari CERTS=le-staging

# Keep the VM alive after the run (for manual inspection)
make test-integration HOST=askari KEEP=1

# Destroy all orphan integration VMs (name-prefix boma-it-*)
make test-integration-clean

HOST is a hostname from the production inventory (the profile tests/integration/ profiles/<host>.json must exist — see Adding a new profile below). CERTS defaults to internal.

Lower-level driver

The driver (scripts/integration-vm.py) exposes individual lifecycle steps for manual or scripted use:

Sub-command	What it does
`up`	Ensure golden image → create ephemeral overlay → cloud-init seed → boot
`apply`	Run the site playbook against the transient inventory (real apply)
`reboot`	`virsh reboot` + wait for a verified reboot (boot-id change) — the step Molecule cannot do
`assert`	Run `tests/integration/verify.yml` (outcome assertions)
`cycle`	`up` → `apply` → `reboot` → `assert` → `down` (default: destroy on pass)
`down`	Destroy the VM + overlay
`prune`	Destroy all `boma-it-*` VMs + overlays (orphan cleanup)
`console`	Print the VM's captured serial-console log

# Example: step through manually
python3 scripts/integration-vm.py up --host askari
python3 scripts/integration-vm.py apply --host askari
python3 scripts/integration-vm.py reboot --host askari
python3 scripts/integration-vm.py assert --host askari
python3 scripts/integration-vm.py down --host askari

Cert tiers

Tier	Flag	Use when
`internal`	`CERTS=internal` (default)	Incident repro, firewall/sshd/boot changes where certs are not under test. Zero deps, instant.
`le-staging`	`CERTS=le-staging`	Testing the Caddy DNS-01 ACME path, cert renewal logic, or the `caddy-gandi` plugin. Real cert files, untrusted root, effectively no rate limits. Requires `vault.gandi.pat`.
`le-prod-wildcard`	`CERTS=le-prod-wildcard`	Verifying TLS behaviour with a real trusted cert. On-demand only — accepted risk R6 (`docs/security/accepted-risks.md`): the production Gandi PAT reaches an ephemeral VM and transient TXT records are written into the real `wingu.me` zone.

A deliberate "no-egress" scenario (reproducing FRICTION 2026-06-17 #4 — the netbird-server GeoLite2 FATAL-loop when NAT masquerade is wiped) must use CERTS=internal: the egress loss is the fault being simulated, and ACME requires egress.

Diagnostics and inspecting a failed VM

Where diagnostics land

Diagnostics from every run are captured in:

~/integration-runs/<timestamp>-<host>/

This directory is gitignored. On a failed assert step, the driver dumps:

nft list ruleset — the live nftables state at failure
docker ps -a — container states
ss -tlnp — listening sockets
journalctl -b — full boot log
systemd-analyze critical-chain — boot timing
Serial console capture (on boot/SSH failure — the automated equivalent of the Hetzner console, addressing FRICTION 2026-06-17 #5)

The agent reads these directly from ~/integration-runs/ — no manual download needed.

Inspecting a kept or failed VM

When a run fails or when KEEP=1 is passed, the VM is left running. Connect to it:

# Serial console (no SSH needed — useful when SSH is the fault)
python3 scripts/integration-vm.py console --host askari
# or directly:
virsh console boma-it-askari
# Exit with Ctrl-]

# SSH (as the ansible user, IP from virsh)
virsh domifaddr boma-it-askari --source lease
ssh ansible@<IP>

# List all integration VMs
virsh list --all | grep boma-it-

Cleanup

# Destroy a specific VM
python3 scripts/integration-vm.py down --host askari

# Reap all orphans
make test-integration-clean
# or:
python3 scripts/integration-vm.py prune

Safety invariants

These make the test tool itself safe — the harness cannot reach or modify production:

Single-host transient inventory — the playbook apply runs against a generated single-host inventory (ansible_host=<VM lease IP>). No real host is ever in scope.
In-VM coordinator only — "be askari" points NetBird at the coordinator running inside the VM itself (localhost endpoint). The VM forms its own one-node mesh; it never enrols in the real NetBird mesh.
Isolated NAT network — test VMs sit on a dedicated libvirt NAT network. Outbound NAT provides ACME/image-pull access, but the VM is not reachable from the LAN (10.20.x) or the real mesh.

Resource constraints

The default VM profile is ~2 vCPU / 3 GiB RAM / 20 GiB thin-provisioned overlay. The driver enforces one integration VM at a time (refusing to start if another boma-it-* VM is already running) and refuses to start below the free-RAM threshold (~13 GiB available on ubongo at baseline, per ADR-025).

Do not run a test-integration cycle alongside a Level-4 browser session (Chromium/Playwright, ADR-017) — both compete for ubongo RAM. The resource guard is the enforcement mechanism, not a suggestion.

Adding a new profile

To make the harness "be" a different host:

Create tests/integration/profiles/<hostname>.json — specifies which roles to apply and base VM sizing for that host.
Create tests/integration/overrides/<hostname>.yml — the explicit stub overlay: cert tier, in-VM coordinator endpoint (if the host runs the coordinator), ansible_host placeholder, and any other variables that must differ from the real inventory (e.g. public DNS → local resolution, geo-DB disable for coordinator).
Add assertions to tests/integration/verify.yml (or extend an existing task with a when: inventory_hostname == '<hostname>' guard) for any host-specific outcomes.
Run make test-integration HOST=<hostname> to validate the new profile.

All stubs must be explicit in the overlay — the real inventory is never edited.

Reproducing the 2026-06-17 incident

The acceptance test for the harness (ADR-025) deliberately reproduces the incident:

Run with today's base (firewall on, no docker_host container-forward drop-in):
```
make test-integration HOST=askari CERTS=internal
```
The assert step must FAIL after reboot (Docker forwarding dead, published ports unreachable). If it passes, the harness is not faithful.
Implement the docker_host container-forward rules (FRICTION 2026-06-17 #1 fix) and re-run. The assert step must PASS across the reboot.

This round-trip proves: (a) the harness faithfully reproduces the incident, and (b) the fix survives a real reboot.

ADR-025 — decision record for this harness (approach, cert tiers, safety invariants)
ADR-008 — testing methodology; this is Level 2/3
docs/security/accepted-risks.md R6 — le-prod-wildcard accepted risk
docs/FRICTION.md — 2026-06-17 signals that motivated this runbook

8.9 KiB Raw Blame History