# ADR-025 — Local VM integration testing on ubongo ## Status Accepted (2026-06-18). Implements ADR-008 Level 2/3 (deferred for lack of hosts; now viable on ubongo). **RED→GREEN acceptance PASSED on real hardware (2026-06-18):** a throwaway KVM VM on ubongo reproduced the 2026-06-17 incident (base's nftables forward default-deny kills Docker forwarding on reboot) — RED — and survived the reboot once the `docker_host` container-forward drop-in was applied — GREEN. Two shakedown learnings added below. ## Context Molecule (ADR-008 Level 1) tests each role in a single Docker container: one `converge`, no real kernel netfilter, no real Docker daemon in the loop, and **no reboot**. That structurally cannot catch an entire class of bug — reboot-survivability, host-firewall × Docker interaction, and boot-ordering — which is exactly the class that caused the **2026-06-17 mesh-hardening incident**. During that incident, `base`'s nftables `forward { policy drop; }` killed the askari Docker host **on reboot**: nftables loaded its default-deny before Docker, breaking published-port DNAT and inter-container forwarding. Public services and the mesh went down. It had worked right after `make deploy`, when Docker's runtime rules still coexisted. `ip_nonlocal_bind` also failed to beat the sshd boot-race, leaving the mesh listener absent at boot. Recovery required the Hetzner console and a WAN-SSH break-glass. Molecule had passed. ADR-008's Level 2/3 was deferred "for lack of hosts." ubongo breaks that deferral: > verified: ubongo KVM capability · Bash (2026-06-18 session) · `/dev/kvm` present + > accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM > free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant **not yet installed** · > 2026-06-18. ## Decision ### 1. Virtualisation approach: libvirt/KVM directly (Approach A) A golden Debian-13 genericcloud qcow2 is cached locally on ubongo. Each run boots an ephemeral qcow2 **overlay** backed by it (the golden image is never mutated), seeded via cloud-init NoCloud, driven by a **stdlib-only** Python driver (`scripts/ integration-vm.py`) over `virsh` / `virt-install` / `cloud-localds`. No `libvirt- python` dependency — the driver stays portable and the role stays lean. ### 2. Fidelity envelope The bugs are **post-boot**, not in the provisioning path. A lightweight local hypervisor is sufficient: real OS, real kernel netfilter, real Docker daemon, real published-port DNAT, a **real reboot**, and the coordinator running inside the VM (so the VM forms its own one-node mesh, reproducing the circular bootstrap). The Proxmox provisioning chrome is not mirrored. ### 3. Scope: one throwaway VM at a time, instantiated from real inventory The first profile is **"be askari"** — a single box running Docker host + NetBird coordinator + mesh peer, mirroring the host whose incident motivates this work. The mechanism is generic: swap the profile to "be" any inventory host. Multi-VM topologies are a deferred extension. ### 4. Acceptance: self-validating against the real failure The harness is accepted when it can, on a local VM: 1. Apply `base` (firewall on, no `docker_host` container-forward drop-in) to a Docker host, reboot, and observe the **2026-06-17 breakage** (Docker forwarding dead, services down). If step 1 passes, the harness is not faithful. 2. Apply the `docker_host` container-forward fix, re-run, and **survive the reboot**. ### 5. Tiered cert fidelity via a `--certs` knob DNS-01 is what makes real certs possible without public inbound (validation is out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which the isolated NAT network provides): | Tier | Description | Default? | |---|---|---| | `internal` | Caddy `tls internal` — zero deps, instant. For incident repro and runs where certs are not under test. | Yes | | `le-staging` | Real DNS-01 ACME against Let's Encrypt **staging** — real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. | Built in v1; use when testing the ACME/cert path. | | `le-prod-wildcard` | A real trusted `*.test.wingu.me` wildcard, **issued once, persisted on ubongo, reused** across runs. | On-demand only. Accepted risk recorded as R6 in `docs/security/accepted-risks.md`. | A deliberate "no-egress" failure scenario (reproducing FRICTION 2026-06-17 #4 — `netbird-server` FATAL-loops on GeoLite2 download when egress is lost) forces `internal`, since ACME requires egress. ### 6. The toolchain is Ansible-managed A new non-service role (`integration_test`, `control` group) installs and enables libvirt + QEMU + virtinst reproducibly. The driver manages the golden image lazily on first run (keeping the role lean; no fiddly download/refresh logic in Ansible). The repo owns ubongo's state. ### 7. Stubs live in an overlay file, never in the real inventory Transient inventory entries for the test VM are generated at runtime as a single-host file. Stubs (cert tier, in-VM coordinator endpoint, VM connection details) live in `tests/integration/overrides/.yml` — an explicit, reviewable overlay. The real inventory is never touched, so `make tf-inventory` and "don't edit inventory directly" stay intact. ## Consequences - **Reconciles ADR-015:** ubongo runs ephemeral KVM test VMs as part of its local-test-runner role — it is still not a production hypervisor. A default VM (~2 vCPU / 3 GiB / 20 GiB thin overlay) against ~13 GiB free is comfortable; the driver enforces **one integration VM at a time** (resource guard, name-prefix `boma-it-*`) and refuses to start below a free-RAM threshold. - **Operationalises the standing rule:** "firewall/sshd/boot changes must be tested on a real VM with a real reboot before they touch a live host" (FRICTION 2026-06-17 #6) becomes a concrete, runnable step documented in `docs/runbooks/integration-testing.md`. - **Accepted risk R6:** `le-prod-wildcard` runs pass the production Gandi PAT (`vault.gandi.pat`) to an ephemeral local VM and write transient `_acme-challenge` TXT records into the real `wingu.me` zone. Scope: on-demand only; `le-staging` is the default. Compensating controls: ephemeral VM, isolated NAT network, TXT records auto-removed by Caddy after validation. - **Three safety invariants** make the test tool itself safe: 1. The transient inventory contains only the test VM — no real host is ever in scope. 2. "Be askari" points NetBird at the in-VM coordinator — the VM forms its own one-node mesh; it never enrols in the real mesh. 3. Test VMs sit on an isolated libvirt NAT network — outbound NAT for ACME/image pulls only, not reachable to the LAN (`10.20.x`) or the real mesh. - **Diagnostics on failure** (catching a bug is the point): failure keeps the VM and dumps `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`, `systemd-analyze critical-chain`. `make test-integration-clean` reaps all `boma-it-*` orphans. Diagnostics land in gitignored `~/integration-runs/-/`. - **Future pinch:** concurrency with the Level-4 Chromium/Playwright stack (ADR-017) competes for ubongo RAM. The resource guard is the v1 answer — one integration VM at a time; don't run alongside a heavy Level-4 session. Revisit at `/capacity-review`. ## Scope **In scope:** reboot-survivability, host-firewall × Docker interaction, boot-ordering, cert/ACME paths, mesh bootstrap on one box. **Out of scope (v1):** multi-VM mini-cluster (inter-host mesh dataplane); CI gate (this is an interactive, agent-driven pre-deploy check; CI stays lint + Molecule per ADR-008/010); the Proxmox provisioning path (the bugs live in the boot/kernel/Docker layer, not provisioning). ## What was ruled out | Option | Reason | |---|---| | **Proxmox VE nested on ubongo** | Highest fidelity including the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs do not live in provisioning. | | **Vagrant + vagrant-libvirt** | Mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin; boxes drift from the real Debian cloud image; the reboot→assert sequence still needs custom logic. | | **terraform-provider-libvirt** | Declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence; adds throwaway state; blurs ADR-006's "TF owns *production* VM existence on Proxmox" boundary. | ## Verified facts (ADR-014) - verified: ubongo KVM capability · Bash · `/dev/kvm` present + accessible (kvm group), Intel VT-x (`vmx`) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free · 2026-06-18. ## Shakedown learnings (2026-06-18 live run) Two findings from the RED→GREEN acceptance run that affect anyone operating the harness: 1. **Boot firmware: UEFI required.** The Debian 13 genericcloud image triple-faults under legacy BIOS/SeaBIOS and does not reach the kernel. Boot the VM with UEFI (`virt-install --boot uefi`; `ovmf` package). The driver does this by default; note it here so the requirement is findable. 2. **`claude` sudo is load-bearing.** VM management (`virsh`, `virt-install`, `cloud-localds`) and offline diagnostics (`nft list ruleset`, `journalctl -b`, `systemd-analyze critical-chain`) all require root. The harness assumes the AI-worker has `NOPASSWD:ALL` sudo on `ubongo` — settled as the ADR-015 amendment (2026-06-18) and registered as R7 in `docs/security/accepted-risks.md`. A `claude` account without sudo will block the harness at the first `virsh` call. The nine full shakedown findings (including the UEFI boot-loop) are in `docs/FRICTION.md`. ## Related - ADR-006 — Terraform owns production VM existence (boundary this ADR respects). - ADR-008 — Testing methodology (Levels 1–4); this ADR is the concrete build of Level 2/3. - ADR-015 — Control host (ubongo); this ADR reconciles "not a hypervisor" with ephemeral test VMs. **Supersedes** ADR-015's "no local sudo" sub-decision for the AI-worker — the shakedown necessitated `claude` NOPASSWD sudo (ADR-023 §4; access model in ADR-021, risk R7). - ADR-016 — Mesh VPN; the "be askari" profile includes the coordinator role. - ADR-020 — Firewall strategy; firewall × Docker interaction is what this harness tests. - ADR-021 — Operational access; sudo model for `claude` and `sjat` on `ubongo`. - ADR-024 — Reverse proxy (Caddy); cert tiers exercise the DNS-01 ACME path.