docs(runbook): integration-testing runbook + pre-flight cross-links
- New docs/runbooks/integration-testing.md: when to use (firewall/ sshd/boot/Docker changes); make test-integration commands; lower- level driver sub-commands; cert tier guidance; diagnostics dir; VM inspection (virsh console / SSH); safety invariants; resource constraints; adding a new profile; self-validating acceptance test. - docs/runbooks/new-host.md: pre-flight warning before deploying lockout-risky changes (firewall/sshd/boot) while break-glass is open - docs/runbooks/new-role.md: step 13 pre-flight for lockout-risky roles Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
4732730515
commit
f51ae1a13d
3 changed files with 250 additions and 1 deletions
229
docs/runbooks/integration-testing.md
Normal file
229
docs/runbooks/integration-testing.md
Normal file
|
|
@ -0,0 +1,229 @@
|
|||
# Runbook — Local VM integration testing
|
||||
|
||||
## When to use this
|
||||
|
||||
Run a local VM integration test before deploying any change that touches:
|
||||
|
||||
- **nftables / firewall rules** (the `firewall` concern of `base`)
|
||||
- **sshd configuration** (listener address, port, key types, `base` hardening)
|
||||
- **boot ordering or kernel parameters** (systemd units, sysctl)
|
||||
- **Docker host networking** (`docker_host` DNAT rules, published-port forwarding, `daemon.json`)
|
||||
|
||||
These are the change classes that Molecule (ADR-008 Level 1) cannot catch: they require
|
||||
a real kernel reboot to surface. This harness is the concrete tool for ADR-008 Level 2/3
|
||||
(see ADR-025) and directly operationalises two standing rules:
|
||||
|
||||
- **"Test risky infra before live deploy"** (standing rule, ubongo memory) — firewall/sshd/boot changes must be tested on a real VM with a real reboot before touching a live host.
|
||||
- **FRICTION 2026-06-17 #6 — validate reboot-recovery before retiring the break-glass** — the lesson crystallised from the mesh-hardening incident: confirm the host recovers from reboot *while you still have the break-glass open*, not after.
|
||||
|
||||
You do not need this runbook for pure-config changes (template rendering, package lists, user management) — Molecule covers those.
|
||||
|
||||
---
|
||||
|
||||
## First-deploy (one-time setup)
|
||||
|
||||
The `integration_test` role installs libvirt + QEMU + virtinst on ubongo and adds the
|
||||
operator accounts (`sjat`, `claude`) to the `libvirt` and `kvm` groups.
|
||||
|
||||
```bash
|
||||
make deploy PLAYBOOK=site LIMIT=ubongo TAGS=integration_test
|
||||
```
|
||||
|
||||
**Re-login after this run** — group membership changes do not take effect in the current
|
||||
session. The driver (`scripts/integration-vm.py`) requires both `libvirt` and `kvm`
|
||||
group membership to create and manage VMs.
|
||||
|
||||
The golden Debian-13 genericcloud qcow2 image is downloaded lazily on the first run
|
||||
(one-time cost, ~500 MB); subsequent runs reuse the cached image.
|
||||
|
||||
---
|
||||
|
||||
## Running a cycle
|
||||
|
||||
### Makefile interface (recommended)
|
||||
|
||||
```bash
|
||||
# Full cycle (provision → apply → reboot → assert → teardown on pass)
|
||||
make test-integration HOST=askari
|
||||
|
||||
# With a specific cert tier
|
||||
make test-integration HOST=askari CERTS=le-staging
|
||||
|
||||
# Keep the VM alive after the run (for manual inspection)
|
||||
make test-integration HOST=askari KEEP=1
|
||||
|
||||
# Destroy all orphan integration VMs (name-prefix boma-it-*)
|
||||
make test-integration-clean
|
||||
```
|
||||
|
||||
`HOST` is a hostname from the production inventory (the profile `tests/integration/
|
||||
profiles/<host>.json` must exist — see Adding a new profile below). `CERTS` defaults
|
||||
to `internal`.
|
||||
|
||||
### Lower-level driver
|
||||
|
||||
The driver (`scripts/integration-vm.py`) exposes individual lifecycle steps for manual
|
||||
or scripted use:
|
||||
|
||||
| Sub-command | What it does |
|
||||
|---|---|
|
||||
| `up` | Ensure golden image → create ephemeral overlay → cloud-init seed → boot |
|
||||
| `apply` | Run the site playbook against the transient inventory (real apply) |
|
||||
| `reboot` | `virsh reboot` + wait for a verified reboot (boot-id change) — the step Molecule cannot do |
|
||||
| `assert` | Run `tests/integration/verify.yml` (outcome assertions) |
|
||||
| `cycle` | `up` → `apply` → `reboot` → `assert` → `down` (default: destroy on pass) |
|
||||
| `down` | Destroy the VM + overlay |
|
||||
| `prune` | Destroy all `boma-it-*` VMs + overlays (orphan cleanup) |
|
||||
| `console` | Print the VM's captured serial-console log |
|
||||
|
||||
```bash
|
||||
# Example: step through manually
|
||||
python3 scripts/integration-vm.py up --host askari
|
||||
python3 scripts/integration-vm.py apply --host askari
|
||||
python3 scripts/integration-vm.py reboot --host askari
|
||||
python3 scripts/integration-vm.py assert --host askari
|
||||
python3 scripts/integration-vm.py down --host askari
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cert tiers
|
||||
|
||||
| Tier | Flag | Use when |
|
||||
|---|---|---|
|
||||
| `internal` | `CERTS=internal` (default) | Incident repro, firewall/sshd/boot changes where certs are not under test. Zero deps, instant. |
|
||||
| `le-staging` | `CERTS=le-staging` | Testing the Caddy DNS-01 ACME path, cert renewal logic, or the `caddy-gandi` plugin. Real cert files, untrusted root, effectively no rate limits. Requires `vault.gandi.pat`. |
|
||||
| `le-prod-wildcard` | `CERTS=le-prod-wildcard` | Verifying TLS behaviour with a real trusted cert. On-demand only — accepted risk R6 (`docs/security/accepted-risks.md`): the production Gandi PAT reaches an ephemeral VM and transient TXT records are written into the real `wingu.me` zone. |
|
||||
|
||||
> A deliberate "no-egress" scenario (reproducing FRICTION 2026-06-17 #4 — the
|
||||
> `netbird-server` GeoLite2 FATAL-loop when NAT masquerade is wiped) **must** use
|
||||
> `CERTS=internal`: the egress loss is the fault being simulated, and ACME requires egress.
|
||||
|
||||
---
|
||||
|
||||
## Diagnostics and inspecting a failed VM
|
||||
|
||||
### Where diagnostics land
|
||||
|
||||
Diagnostics from every run are captured in:
|
||||
|
||||
```
|
||||
~/integration-runs/<timestamp>-<host>/
|
||||
```
|
||||
|
||||
This directory is gitignored. On a failed assert step, the driver dumps:
|
||||
|
||||
- `nft list ruleset` — the live nftables state at failure
|
||||
- `docker ps -a` — container states
|
||||
- `ss -tlnp` — listening sockets
|
||||
- `journalctl -b` — full boot log
|
||||
- `systemd-analyze critical-chain` — boot timing
|
||||
- Serial console capture (on boot/SSH failure — the automated equivalent of the Hetzner
|
||||
console, addressing FRICTION 2026-06-17 #5)
|
||||
|
||||
The agent reads these directly from `~/integration-runs/` — no manual download needed.
|
||||
|
||||
### Inspecting a kept or failed VM
|
||||
|
||||
When a run fails or when `KEEP=1` is passed, the VM is left running. Connect to it:
|
||||
|
||||
```bash
|
||||
# Serial console (no SSH needed — useful when SSH is the fault)
|
||||
python3 scripts/integration-vm.py console --host askari
|
||||
# or directly:
|
||||
virsh console boma-it-askari
|
||||
# Exit with Ctrl-]
|
||||
|
||||
# SSH (as the ansible user, IP from virsh)
|
||||
virsh domifaddr boma-it-askari --source lease
|
||||
ssh ansible@<IP>
|
||||
|
||||
# List all integration VMs
|
||||
virsh list --all | grep boma-it-
|
||||
```
|
||||
|
||||
### Cleanup
|
||||
|
||||
```bash
|
||||
# Destroy a specific VM
|
||||
python3 scripts/integration-vm.py down --host askari
|
||||
|
||||
# Reap all orphans
|
||||
make test-integration-clean
|
||||
# or:
|
||||
python3 scripts/integration-vm.py prune
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Safety invariants
|
||||
|
||||
These make the test tool itself safe — the harness cannot reach or modify production:
|
||||
|
||||
1. **Single-host transient inventory** — the playbook apply runs against a generated
|
||||
single-host inventory (`ansible_host=<VM lease IP>`). No real host is ever in scope.
|
||||
2. **In-VM coordinator only** — "be askari" points NetBird at the coordinator running
|
||||
inside the VM itself (localhost endpoint). The VM forms its own one-node mesh; it
|
||||
never enrols in the real NetBird mesh.
|
||||
3. **Isolated NAT network** — test VMs sit on a dedicated libvirt NAT network.
|
||||
Outbound NAT provides ACME/image-pull access, but the VM is not reachable from
|
||||
the LAN (`10.20.x`) or the real mesh.
|
||||
|
||||
---
|
||||
|
||||
## Resource constraints
|
||||
|
||||
The default VM profile is ~2 vCPU / 3 GiB RAM / 20 GiB thin-provisioned overlay. The
|
||||
driver enforces **one integration VM at a time** (refusing to start if another
|
||||
`boma-it-*` VM is already running) and refuses to start below the free-RAM threshold
|
||||
(~13 GiB available on ubongo at baseline, per ADR-025).
|
||||
|
||||
**Do not run a test-integration cycle alongside a Level-4 browser session**
|
||||
(Chromium/Playwright, ADR-017) — both compete for ubongo RAM. The resource guard is the
|
||||
enforcement mechanism, not a suggestion.
|
||||
|
||||
---
|
||||
|
||||
## Adding a new profile
|
||||
|
||||
To make the harness "be" a different host:
|
||||
|
||||
1. Create `tests/integration/profiles/<hostname>.json` — specifies which roles to apply
|
||||
and base VM sizing for that host.
|
||||
2. Create `tests/integration/overrides/<hostname>.yml` — the explicit stub overlay:
|
||||
cert tier, in-VM coordinator endpoint (if the host runs the coordinator),
|
||||
`ansible_host` placeholder, and any other variables that must differ from the real
|
||||
inventory (e.g. public DNS → local resolution, geo-DB disable for coordinator).
|
||||
3. Add assertions to `tests/integration/verify.yml` (or extend an existing task with a
|
||||
`when: inventory_hostname == '<hostname>'` guard) for any host-specific outcomes.
|
||||
4. Run `make test-integration HOST=<hostname>` to validate the new profile.
|
||||
|
||||
All stubs must be explicit in the overlay — the real inventory is never edited.
|
||||
|
||||
---
|
||||
|
||||
## Reproducing the 2026-06-17 incident
|
||||
|
||||
The acceptance test for the harness (ADR-025) deliberately reproduces the incident:
|
||||
|
||||
1. Run with today's `base` (firewall on, no `docker_host` container-forward drop-in):
|
||||
```bash
|
||||
make test-integration HOST=askari CERTS=internal
|
||||
```
|
||||
The assert step **must FAIL** after reboot (Docker forwarding dead, published ports
|
||||
unreachable). If it passes, the harness is not faithful.
|
||||
|
||||
2. Implement the `docker_host` container-forward rules (FRICTION 2026-06-17 #1 fix) and
|
||||
re-run. The assert step **must PASS** across the reboot.
|
||||
|
||||
This round-trip proves: (a) the harness faithfully reproduces the incident, and (b) the
|
||||
fix survives a real reboot.
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- ADR-025 — decision record for this harness (approach, cert tiers, safety invariants)
|
||||
- ADR-008 — testing methodology; this is Level 2/3
|
||||
- `docs/security/accepted-risks.md` R6 — `le-prod-wildcard` accepted risk
|
||||
- `docs/FRICTION.md` — 2026-06-17 signals that motivated this runbook
|
||||
|
|
@ -109,6 +109,13 @@ make check PLAYBOOK=site
|
|||
# Should report no changes
|
||||
```
|
||||
|
||||
> **Pre-flight before lockout-risky changes (firewall / sshd / boot):** before applying
|
||||
> any change that touches nftables rules, SSH configuration, or boot ordering, run
|
||||
> `make test-integration HOST=<name>` and confirm reboot-recovery on the local VM
|
||||
> **while the break-glass (Proxmox console / Hetzner console) is still open**. Do not
|
||||
> retire the break-glass until the integration test passes. See
|
||||
> `docs/runbooks/integration-testing.md` and ADR-025.
|
||||
|
||||
---
|
||||
|
||||
## Part E — Control node (`ubongo`, manual exception)
|
||||
|
|
|
|||
|
|
@ -114,7 +114,20 @@ reason and gets no `BACKUP.md`. Once the backup node exists, `/check-backup <rol
|
|||
proves the declared state is captured — part of the service-clearance gate
|
||||
(`docs/security/service-checklist.md`).
|
||||
|
||||
### 13. Commit
|
||||
### 13. Pre-flight for lockout-risky roles
|
||||
|
||||
If the new role touches nftables rules, SSH configuration, or boot ordering, run a
|
||||
local VM integration test and confirm reboot-recovery **before** deploying to a live
|
||||
host and while the host's break-glass (Proxmox console / Hetzner console) is still
|
||||
open:
|
||||
|
||||
```bash
|
||||
make test-integration HOST=<target-host>
|
||||
```
|
||||
|
||||
See `docs/runbooks/integration-testing.md` and ADR-025.
|
||||
|
||||
### 14. Commit
|
||||
|
||||
```bash
|
||||
git checkout -b role/<rolename>
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue