From edcc347a95b27a8027e9706be9aecc3286646366 Mon Sep 17 00:00:00 2001 From: sjat Date: Thu, 18 Jun 2026 12:49:52 +0200 Subject: [PATCH] docs(adr): ADR-025 local VM integration testing Accepted decision to implement ADR-008 Level 2/3 on ubongo via libvirt/KVM directly: throwaway VM overlays, stdlib-only driver, tiered cert fidelity, three safety invariants. Addresses the 2026-06-17 mesh-hardening incident's reboot-survivability gap. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../025-local-vm-integration-testing.md | 157 ++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 docs/decisions/025-local-vm-integration-testing.md diff --git a/docs/decisions/025-local-vm-integration-testing.md b/docs/decisions/025-local-vm-integration-testing.md new file mode 100644 index 0000000..dc3fe21 --- /dev/null +++ b/docs/decisions/025-local-vm-integration-testing.md @@ -0,0 +1,157 @@ +# ADR-025 — Local VM integration testing on ubongo + +## Status + +Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now +viable on ubongo). The harness code is built and lint+pytest-clean; RED/GREEN +acceptance is pending the first live run on ubongo. + +## Context + +Molecule (ADR-008 Level 1) tests each role in a single Docker container: one +`converge`, no real kernel netfilter, no real Docker daemon in the loop, and **no +reboot**. That structurally cannot catch an entire class of bug — reboot-survivability, +host-firewall × Docker interaction, and boot-ordering — which is exactly the class +that caused the **2026-06-17 mesh-hardening incident**. + +During that incident, `base`'s nftables `forward { policy drop; }` killed the askari +Docker host **on reboot**: nftables loaded its default-deny before Docker, breaking +published-port DNAT and inter-container forwarding. Public services and the mesh went +down. It had worked right after `make deploy`, when Docker's runtime rules still +coexisted. `ip_nonlocal_bind` also failed to beat the sshd boot-race, leaving the mesh +listener absent at boot. Recovery required the Hetzner console and a WAN-SSH +break-glass. Molecule had passed. + +ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral: + +> verified: ubongo KVM capability · Bash (2026-06-18 session) · `/dev/kvm` present + +> accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM +> free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant **not yet installed** · +> 2026-06-18. + +## Decision + +### 1. Virtualisation approach: libvirt/KVM directly (Approach A) + +A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an +ephemeral qcow2 **overlay** backed by it (the golden image is never mutated), seeded +via cloud-init NoCloud, driven by a **stdlib-only** Python driver (`scripts/ +integration-vm.py`) over `virsh` / `virt-install` / `cloud-localds`. No `libvirt- +python` dependency — the driver stays portable and the role stays lean. + +### 2. Fidelity envelope + +The bugs are **post-boot**, not in the provisioning path. A lightweight local hypervisor +is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port +DNAT, a **real reboot**, and the coordinator running inside the VM (so the VM forms its +own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome +is not mirrored. + +### 3. Scope: one throwaway VM at a time, instantiated from real inventory + +The first profile is **"be askari"** — a single box running Docker host + NetBird +coordinator + mesh peer, mirroring the host whose incident motivates this work. The +mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies +are a deferred extension. + +### 4. Acceptance: self-validating against the real failure + +The harness is accepted when it can, on a local VM: + +1. Apply `base` (firewall on, no `docker_host` container-forward drop-in) to a Docker + host, reboot, and observe the **2026-06-17 breakage** (Docker forwarding dead, + services down). If step 1 passes, the harness is not faithful. +2. Apply the `docker_host` container-forward fix, re-run, and **survive the reboot**. + +### 5. Tiered cert fidelity via a `--certs` knob + +DNS-01 is what makes real certs possible without public inbound (validation is +out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which +the isolated NAT network provides): + +| Tier | Description | Default? | +|---|---|---| +| `internal` | Caddy `tls internal` — zero deps, instant. For incident repro and runs where certs are not under test. | Yes | +| `le-staging` | Real DNS-01 ACME against Let's Encrypt **staging** — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. | +| `le-prod-wildcard` | A real trusted `*.test.wingu.me` wildcard, **issued once, persisted on ubongo, reused** across runs. | On-demand only. Accepted risk recorded as R6 in `docs/security/accepted-risks.md`. | + +A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4 — +`netbird-server` FATAL-loops on GeoLite2 download when egress is lost) forces +`internal`, since ACME requires egress. + +### 6. The toolchain is Ansible-managed + +A new non-service role (`integration_test`, `control` group) installs and enables +libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on +first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The +repo owns ubongo's state. + +### 7. Stubs live in an overlay file, never in the real inventory + +Transient inventory entries for the test VM are generated at runtime as a single-host +file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in +`tests/integration/overrides/.yml` — an explicit, reviewable overlay. The real +inventory is never touched, so `make tf-inventory` and "don't edit inventory directly" +stay intact. + +## Consequences + +- **Reconciles ADR-015:** ubongo runs ephemeral KVM test VMs as part of its + local-test-runner role — it is still not a production hypervisor. A default VM + (~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the + driver enforces **one integration VM at a time** (resource guard, name-prefix + `boma-it-*`) and refuses to start below a free-RAM threshold. +- **Operationalises the standing rule:** "firewall/sshd/boot changes must be tested on + a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6) + becomes a concrete, runnable step documented in `docs/runbooks/integration-testing.md`. +- **Accepted risk R6:** `le-prod-wildcard` runs pass the production Gandi PAT + (`vault.gandi.pat`) to an ephemeral local VM and write transient `_acme-challenge` + TXT records into the real `wingu.me` zone. Scope: on-demand only; `le-staging` is the + default. Compensating controls: ephemeral VM, isolated NAT network, TXT records + auto-removed by Caddy after validation. +- **Three safety invariants** make the test tool itself safe: + 1. The transient inventory contains only the test VM — no real host is ever in scope. + 2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node + mesh; it never enrols in the real mesh. + 3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls + only, not reachable to the LAN (`10.20.x`) or the real mesh. +- **Diagnostics on failure** (catching a bug is the point): failure keeps the VM and + dumps `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`, + `systemd-analyze critical-chain`. `make test-integration-clean` reaps all `boma-it-*` + orphans. Diagnostics land in gitignored `~/integration-runs/-/`. +- **Future pinch:** concurrency with the Level-4 Chromium/Playwright stack (ADR-017) + competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a + time; don't run alongside a heavy Level-4 session. Revisit at `/capacity-review`. + +## Scope + +**In scope:** reboot-survivability, host-firewall × Docker interaction, boot-ordering, +cert/ACME paths, mesh bootstrap on one box. + +**Out of scope (v1):** multi-VM mini-cluster (inter-host mesh dataplane); CI gate +(this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per +ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker +layer, not provisioning). + +## What was ruled out + +| Option | Reason | +|---|---| +| **Proxmox VE nested on ubongo** | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. | +| **Vagrant + vagrant-libvirt** | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. | +| **terraform-provider-libvirt** | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns *production* VM existence on Proxmox" boundary. | + +## Verified facts (ADR-014) + +- verified: ubongo KVM capability · Bash · `/dev/kvm` present + accessible (kvm group), + Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB + disk free · 2026-06-18. + +## Related + +- ADR-006 — Terraform owns production VM existence (boundary this ADR respects). +- ADR-008 — Testing methodology (Levels 1–4); this ADR is the concrete build of Level 2/3. +- ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs. +- ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role. +- ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests. +- ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.