From f51ae1a13d87e8ca6bfc7caf1c473daca976237e Mon Sep 17 00:00:00 2001 From: sjat Date: Thu, 18 Jun 2026 12:52:53 +0200 Subject: [PATCH] docs(runbook): integration-testing runbook + pre-flight cross-links - New docs/runbooks/integration-testing.md: when to use (firewall/ sshd/boot/Docker changes); make test-integration commands; lower- level driver sub-commands; cert tier guidance; diagnostics dir; VM inspection (virsh console / SSH); safety invariants; resource constraints; adding a new profile; self-validating acceptance test. - docs/runbooks/new-host.md: pre-flight warning before deploying lockout-risky changes (firewall/sshd/boot) while break-glass is open - docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/runbooks/integration-testing.md | 229 +++++++++++++++++++++++++++ docs/runbooks/new-host.md | 7 + docs/runbooks/new-role.md | 15 +- 3 files changed, 250 insertions(+), 1 deletion(-) create mode 100644 docs/runbooks/integration-testing.md diff --git a/docs/runbooks/integration-testing.md b/docs/runbooks/integration-testing.md new file mode 100644 index 0000000..38a2f09 --- /dev/null +++ b/docs/runbooks/integration-testing.md @@ -0,0 +1,229 @@ +# Runbook — Local VM integration testing + +## When to use this + +Run a local VM integration test before deploying any change that touches: + +- **nftables / firewall rules** (the `firewall` concern of `base`) +- **sshd configuration** (listener address, port, key types, `base` hardening) +- **boot ordering or kernel parameters** (systemd units, sysctl) +- **Docker host networking** (`docker_host` DNAT rules, published-port forwarding, `daemon.json`) + +These are the change classes that Molecule (ADR-008 Level 1) cannot catch: they require +a real kernel reboot to surface. This harness is the concrete tool for ADR-008 Level 2/3 +(see ADR-025) and directly operationalises two standing rules: + +- **"Test risky infra before live deploy"** (standing rule, ubongo memory) — firewall/sshd/boot changes must be tested on a real VM with a real reboot before touching a live host. +- **FRICTION 2026-06-17 #6 — validate reboot-recovery before retiring the break-glass** — the lesson crystallised from the mesh-hardening incident: confirm the host recovers from reboot *while you still have the break-glass open*, not after. + +You do not need this runbook for pure-config changes (template rendering, package lists, user management) — Molecule covers those. + +--- + +## First-deploy (one-time setup) + +The `integration_test` role installs libvirt + QEMU + virtinst on ubongo and adds the +operator accounts (`sjat`, `claude`) to the `libvirt` and `kvm` groups. + +```bash +make deploy PLAYBOOK=site LIMIT=ubongo TAGS=integration_test +``` + +**Re-login after this run** — group membership changes do not take effect in the current +session. The driver (`scripts/integration-vm.py`) requires both `libvirt` and `kvm` +group membership to create and manage VMs. + +The golden Debian-13 genericcloud qcow2 image is downloaded lazily on the first run +(one-time cost, ~500 MB); subsequent runs reuse the cached image. + +--- + +## Running a cycle + +### Makefile interface (recommended) + +```bash +# Full cycle (provision → apply → reboot → assert → teardown on pass) +make test-integration HOST=askari + +# With a specific cert tier +make test-integration HOST=askari CERTS=le-staging + +# Keep the VM alive after the run (for manual inspection) +make test-integration HOST=askari KEEP=1 + +# Destroy all orphan integration VMs (name-prefix boma-it-*) +make test-integration-clean +``` + +`HOST` is a hostname from the production inventory (the profile `tests/integration/ +profiles/.json` must exist — see Adding a new profile below). `CERTS` defaults +to `internal`. + +### Lower-level driver + +The driver (`scripts/integration-vm.py`) exposes individual lifecycle steps for manual +or scripted use: + +| Sub-command | What it does | +|---|---| +| `up` | Ensure golden image → create ephemeral overlay → cloud-init seed → boot | +| `apply` | Run the site playbook against the transient inventory (real apply) | +| `reboot` | `virsh reboot` + wait for a verified reboot (boot-id change) — the step Molecule cannot do | +| `assert` | Run `tests/integration/verify.yml` (outcome assertions) | +| `cycle` | `up` → `apply` → `reboot` → `assert` → `down` (default: destroy on pass) | +| `down` | Destroy the VM + overlay | +| `prune` | Destroy all `boma-it-*` VMs + overlays (orphan cleanup) | +| `console` | Print the VM's captured serial-console log | + +```bash +# Example: step through manually +python3 scripts/integration-vm.py up --host askari +python3 scripts/integration-vm.py apply --host askari +python3 scripts/integration-vm.py reboot --host askari +python3 scripts/integration-vm.py assert --host askari +python3 scripts/integration-vm.py down --host askari +``` + +--- + +## Cert tiers + +| Tier | Flag | Use when | +|---|---|---| +| `internal` | `CERTS=internal` (default) | Incident repro, firewall/sshd/boot changes where certs are not under test. Zero deps, instant. | +| `le-staging` | `CERTS=le-staging` | Testing the Caddy DNS-01 ACME path, cert renewal logic, or the `caddy-gandi` plugin. Real cert files, untrusted root, effectively no rate limits. Requires `vault.gandi.pat`. | +| `le-prod-wildcard` | `CERTS=le-prod-wildcard` | Verifying TLS behaviour with a real trusted cert. On-demand only — accepted risk R6 (`docs/security/accepted-risks.md`): the production Gandi PAT reaches an ephemeral VM and transient TXT records are written into the real `wingu.me` zone. | + +> A deliberate "no-egress" scenario (reproducing FRICTION 2026-06-17 #4 — the +> `netbird-server` GeoLite2 FATAL-loop when NAT masquerade is wiped) **must** use +> `CERTS=internal`: the egress loss is the fault being simulated, and ACME requires egress. + +--- + +## Diagnostics and inspecting a failed VM + +### Where diagnostics land + +Diagnostics from every run are captured in: + +``` +~/integration-runs/-/ +``` + +This directory is gitignored. On a failed assert step, the driver dumps: + +- `nft list ruleset` — the live nftables state at failure +- `docker ps -a` — container states +- `ss -tlnp` — listening sockets +- `journalctl -b` — full boot log +- `systemd-analyze critical-chain` — boot timing +- Serial console capture (on boot/SSH failure — the automated equivalent of the Hetzner + console, addressing FRICTION 2026-06-17 #5) + +The agent reads these directly from `~/integration-runs/` — no manual download needed. + +### Inspecting a kept or failed VM + +When a run fails or when `KEEP=1` is passed, the VM is left running. Connect to it: + +```bash +# Serial console (no SSH needed — useful when SSH is the fault) +python3 scripts/integration-vm.py console --host askari +# or directly: +virsh console boma-it-askari +# Exit with Ctrl-] + +# SSH (as the ansible user, IP from virsh) +virsh domifaddr boma-it-askari --source lease +ssh ansible@ + +# List all integration VMs +virsh list --all | grep boma-it- +``` + +### Cleanup + +```bash +# Destroy a specific VM +python3 scripts/integration-vm.py down --host askari + +# Reap all orphans +make test-integration-clean +# or: +python3 scripts/integration-vm.py prune +``` + +--- + +## Safety invariants + +These make the test tool itself safe — the harness cannot reach or modify production: + +1. **Single-host transient inventory** — the playbook apply runs against a generated + single-host inventory (`ansible_host=`). No real host is ever in scope. +2. **In-VM coordinator only** — "be askari" points NetBird at the coordinator running + inside the VM itself (localhost endpoint). The VM forms its own one-node mesh; it + never enrols in the real NetBird mesh. +3. **Isolated NAT network** — test VMs sit on a dedicated libvirt NAT network. + Outbound NAT provides ACME/image-pull access, but the VM is not reachable from + the LAN (`10.20.x`) or the real mesh. + +--- + +## Resource constraints + +The default VM profile is ~2 vCPU / 3 GiB RAM / 20 GiB thin-provisioned overlay. The +driver enforces **one integration VM at a time** (refusing to start if another +`boma-it-*` VM is already running) and refuses to start below the free-RAM threshold +(~13 GiB available on ubongo at baseline, per ADR-025). + +**Do not run a test-integration cycle alongside a Level-4 browser session** +(Chromium/Playwright, ADR-017) — both compete for ubongo RAM. The resource guard is the +enforcement mechanism, not a suggestion. + +--- + +## Adding a new profile + +To make the harness "be" a different host: + +1. Create `tests/integration/profiles/.json` — specifies which roles to apply + and base VM sizing for that host. +2. Create `tests/integration/overrides/.yml` — the explicit stub overlay: + cert tier, in-VM coordinator endpoint (if the host runs the coordinator), + `ansible_host` placeholder, and any other variables that must differ from the real + inventory (e.g. public DNS → local resolution, geo-DB disable for coordinator). +3. Add assertions to `tests/integration/verify.yml` (or extend an existing task with a + `when: inventory_hostname == ''` guard) for any host-specific outcomes. +4. Run `make test-integration HOST=` to validate the new profile. + +All stubs must be explicit in the overlay — the real inventory is never edited. + +--- + +## Reproducing the 2026-06-17 incident + +The acceptance test for the harness (ADR-025) deliberately reproduces the incident: + +1. Run with today's `base` (firewall on, no `docker_host` container-forward drop-in): + ```bash + make test-integration HOST=askari CERTS=internal + ``` + The assert step **must FAIL** after reboot (Docker forwarding dead, published ports + unreachable). If it passes, the harness is not faithful. + +2. Implement the `docker_host` container-forward rules (FRICTION 2026-06-17 #1 fix) and + re-run. The assert step **must PASS** across the reboot. + +This round-trip proves: (a) the harness faithfully reproduces the incident, and (b) the +fix survives a real reboot. + +--- + +## Related + +- ADR-025 — decision record for this harness (approach, cert tiers, safety invariants) +- ADR-008 — testing methodology; this is Level 2/3 +- `docs/security/accepted-risks.md` R6 — `le-prod-wildcard` accepted risk +- `docs/FRICTION.md` — 2026-06-17 signals that motivated this runbook diff --git a/docs/runbooks/new-host.md b/docs/runbooks/new-host.md index 2ea3db1..d0d5962 100644 --- a/docs/runbooks/new-host.md +++ b/docs/runbooks/new-host.md @@ -109,6 +109,13 @@ make check PLAYBOOK=site # Should report no changes ``` +> **Pre-flight before lockout-risky changes (firewall / sshd / boot):** before applying +> any change that touches nftables rules, SSH configuration, or boot ordering, run +> `make test-integration HOST=` and confirm reboot-recovery on the local VM +> **while the break-glass (Proxmox console / Hetzner console) is still open**. Do not +> retire the break-glass until the integration test passes. See +> `docs/runbooks/integration-testing.md` and ADR-025. + --- ## Part E — Control node (`ubongo`, manual exception) diff --git a/docs/runbooks/new-role.md b/docs/runbooks/new-role.md index 714e1fe..5788977 100644 --- a/docs/runbooks/new-role.md +++ b/docs/runbooks/new-role.md @@ -114,7 +114,20 @@ reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup +``` + +See `docs/runbooks/integration-testing.md` and ADR-025. + +### 14. Commit ```bash git checkout -b role/