boma/docs/decisions/025-local-vm-integration-testing.md
sjat bc8592616b fix: address final whole-branch review findings
- ADR-023 §4: ADR-015 no-sudo sub-decision now Superseded-by ADR-025 (bidirectional), not just an in-place amendment.
- STATUS: drop the deferred `reset` verb; honest integration_test (molecule not run in this env; applied to ubongo) + verify (forward/DNAT, not wt0); RED->GREEN validated.
- driver: remove unused `import shutil`.
- README: fix the ADR-025 link filename.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-18 21:52:28 +02:00

10 KiB
Raw Blame History

ADR-025 — Local VM integration testing on ubongo

Status

Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now viable on ubongo). RED→GREEN acceptance PASSED on real hardware (2026-06-18): a throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base's nftables forward default-deny kills Docker forwarding on reboot) — RED — and survived the reboot once the docker_host container-forward drop-in was applied — GREEN. Two shakedown learnings added below.

Context

Molecule (ADR-008 Level 1) tests each role in a single Docker container: one converge, no real kernel netfilter, no real Docker daemon in the loop, and no reboot. That structurally cannot catch an entire class of bug — reboot-survivability, host-firewall × Docker interaction, and boot-ordering — which is exactly the class that caused the 2026-06-17 mesh-hardening incident.

During that incident, base's nftables forward { policy drop; } killed the askari Docker host on reboot: nftables loaded its default-deny before Docker, breaking published-port DNAT and inter-container forwarding. Public services and the mesh went down. It had worked right after make deploy, when Docker's runtime rules still coexisted. ip_nonlocal_bind also failed to beat the sshd boot-race, leaving the mesh listener absent at boot. Recovery required the Hetzner console and a WAN-SSH break-glass. Molecule had passed.

ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral:

verified: ubongo KVM capability · Bash (2026-06-18 session) · /dev/kvm present + accessible (kvm group), Intel VT-x (vmx) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant not yet installed · 2026-06-18.

Decision

1. Virtualisation approach: libvirt/KVM directly (Approach A)

A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an ephemeral qcow2 overlay backed by it (the golden image is never mutated), seeded via cloud-init NoCloud, driven by a stdlib-only Python driver (scripts/ integration-vm.py) over virsh / virt-install / cloud-localds. No libvirt- python dependency — the driver stays portable and the role stays lean.

2. Fidelity envelope

The bugs are post-boot, not in the provisioning path. A lightweight local hypervisor is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port DNAT, a real reboot, and the coordinator running inside the VM (so the VM forms its own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome is not mirrored.

3. Scope: one throwaway VM at a time, instantiated from real inventory

The first profile is "be askari" — a single box running Docker host + NetBird coordinator + mesh peer, mirroring the host whose incident motivates this work. The mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies are a deferred extension.

4. Acceptance: self-validating against the real failure

The harness is accepted when it can, on a local VM:

  1. Apply base (firewall on, no docker_host container-forward drop-in) to a Docker host, reboot, and observe the 2026-06-17 breakage (Docker forwarding dead, services down). If step 1 passes, the harness is not faithful.
  2. Apply the docker_host container-forward fix, re-run, and survive the reboot.

5. Tiered cert fidelity via a --certs knob

DNS-01 is what makes real certs possible without public inbound (validation is out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which the isolated NAT network provides):

Tier Description Default?
internal Caddy tls internal — zero deps, instant. For incident repro and runs where certs are not under test. Yes
le-staging Real DNS-01 ACME against Let's Encrypt staging — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. Built in v1; use when testing the ACME/cert path.
le-prod-wildcard A real trusted *.test.wingu.me wildcard, issued once, persisted on ubongo, reused across runs. On-demand only. Accepted risk recorded as R6 in docs/security/accepted-risks.md.

A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4 — netbird-server FATAL-loops on GeoLite2 download when egress is lost) forces internal, since ACME requires egress.

6. The toolchain is Ansible-managed

A new non-service role (integration_test, control group) installs and enables libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The repo owns ubongo's state.

7. Stubs live in an overlay file, never in the real inventory

Transient inventory entries for the test VM are generated at runtime as a single-host file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in tests/integration/overrides/<host>.yml — an explicit, reviewable overlay. The real inventory is never touched, so make tf-inventory and "don't edit inventory directly" stay intact.

Consequences

  • Reconciles ADR-015: ubongo runs ephemeral KVM test VMs as part of its local-test-runner role — it is still not a production hypervisor. A default VM (~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the driver enforces one integration VM at a time (resource guard, name-prefix boma-it-*) and refuses to start below a free-RAM threshold.
  • Operationalises the standing rule: "firewall/sshd/boot changes must be tested on a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6) becomes a concrete, runnable step documented in docs/runbooks/integration-testing.md.
  • Accepted risk R6: le-prod-wildcard runs pass the production Gandi PAT (vault.gandi.pat) to an ephemeral local VM and write transient _acme-challenge TXT records into the real wingu.me zone. Scope: on-demand only; le-staging is the default. Compensating controls: ephemeral VM, isolated NAT network, TXT records auto-removed by Caddy after validation.
  • Three safety invariants make the test tool itself safe:
    1. The transient inventory contains only the test VM — no real host is ever in scope.
    2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node mesh; it never enrols in the real mesh.
    3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls only, not reachable to the LAN (10.20.x) or the real mesh.
  • Diagnostics on failure (catching a bug is the point): failure keeps the VM and dumps nft list ruleset, docker ps, ss -tlnp, journalctl -b, systemd-analyze critical-chain. make test-integration-clean reaps all boma-it-* orphans. Diagnostics land in gitignored ~/integration-runs/<ts>-<host>/.
  • Future pinch: concurrency with the Level-4 Chromium/Playwright stack (ADR-017) competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a time; don't run alongside a heavy Level-4 session. Revisit at /capacity-review.

Scope

In scope: reboot-survivability, host-firewall × Docker interaction, boot-ordering, cert/ACME paths, mesh bootstrap on one box.

Out of scope (v1): multi-VM mini-cluster (inter-host mesh dataplane); CI gate (this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker layer, not provisioning).

What was ruled out

Option Reason
Proxmox VE nested on ubongo Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning.
Vagrant + vagrant-libvirt Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic.
terraform-provider-libvirt Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns production VM existence on Proxmox" boundary.

Verified facts (ADR-014)

  • verified: ubongo KVM capability · Bash · /dev/kvm present + accessible (kvm group), Intel VT-x (vmx) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free · 2026-06-18.

Shakedown learnings (2026-06-18 live run)

Two findings from the RED→GREEN acceptance run that affect anyone operating the harness:

  1. Boot firmware: UEFI required. The Debian 13 genericcloud image triple-faults under legacy BIOS/SeaBIOS and does not reach the kernel. Boot the VM with UEFI (virt-install --boot uefi; ovmf package). The driver does this by default; note it here so the requirement is findable.

  2. claude sudo is load-bearing. VM management (virsh, virt-install, cloud-localds) and offline diagnostics (nft list ruleset, journalctl -b, systemd-analyze critical-chain) all require root. The harness assumes the AI-worker has NOPASSWD:ALL sudo on ubongo — settled as the ADR-015 amendment (2026-06-18) and registered as R7 in docs/security/accepted-risks.md. A claude account without sudo will block the harness at the first virsh call.

The nine full shakedown findings (including the UEFI boot-loop) are in docs/FRICTION.md.

  • ADR-006 — Terraform owns production VM existence (boundary this ADR respects).
  • ADR-008 — Testing methodology (Levels 14); this ADR is the concrete build of Level 2/3.
  • ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs. Supersedes ADR-015's "no local sudo" sub-decision for the AI-worker — the shakedown necessitated claude NOPASSWD sudo (ADR-023 §4; access model in ADR-021, risk R7).
  • ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role.
  • ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests.
  • ADR-021 — Operational access; sudo model for claude and sjat on ubongo.
  • ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.