Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
16 KiB
Local VM integration testing on ubongo (design)
Status: Designed, not built. Resolves docs/TODO.md item 2.4 (Local VM integration
testing on ubongo, pre-deploy).
Date: 2026-06-18.
Implements: the concrete build of ADR-008 Level 2/3 (staging/integration), deferred
for lack of hosts but hostable on ubongo. To be recorded as ADR-025.
Context
Molecule (ADR-008 Level 1) tests each role in a single Docker container: one converge,
no real kernel netfilter, no real Docker daemon in the loop, and no reboot. That
structurally cannot catch an entire class of bug — reboot-survivability, host-firewall ×
Docker interaction, and boot-ordering — which is exactly the class that caused the
2026-06-17 mesh-hardening incident:
base's nftablesforward { policy drop; }killed the askari Docker host on reboot (nftables loaded its default-deny before Docker, breaking published-port DNAT and inter-container forwarding → public services + the mesh went down). It had worked right aftermake deploy, when Docker's runtime rules still coexisted. (FRICTION 2026-06-17 #1.)ip_nonlocal_binddid not beat the sshd boot-race; sshd bound to thewt0mesh IP had no listener at boot. (FRICTION #2.)- The coordinator host could not bootstrap the mesh it itself hosts. (FRICTION #3.)
- NetBird
netbird-serverFATAL-loops on the GeoLite2 download when egress is lost — and egress was lost whennft flushwiped Docker's NAT masquerade. (FRICTION #4.)
Recovery needed the Hetzner console + a WAN-SSH break-glass. The lesson, already crystallised as a standing rule: firewall/sshd/boot changes must be tested on a real VM with a real reboot before they touch a live host, and a non-mesh break-glass must be kept.
This spec defines a way for the agent to spin up throwaway KVM VMs locally on ubongo that mirror a target host (real Docker, a real reboot, the real role apply) and validate risky infra changes before a live deploy. ubongo can host this today:
verified: ubongo KVM capability · Bash (this session) ·
/dev/kvmpresent + accessible (kvm group), Intel VT-x (vmx) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant not yet installed · 2026-06-18.
Goals
- Reproduce the 2026-06-17 bug class locally: real OS boot, real Docker, real netfilter, the real role apply, a real reboot, then outcome assertions.
- Let the agent drive the full loop autonomously: provision → apply → reboot → assert → teardown, with diagnostics captured on failure.
- Mirror a real host from inventory (first profile: "be askari"), so the apply is faithful, not synthetic.
- Be the concrete tool that operationalises the standing "test risky infra before live deploy" rule.
Non-goals (v1)
- Not a production hypervisor on ubongo (reconciles ADR-015 — see Governance).
- Not nested Proxmox; the provisioning chrome (template clone / Terraform) is not mirrored — every incident bug lives in the boot/kernel/Docker layer, not provisioning.
- Not a multi-VM mini-cluster; one VM at a time. (All six 2026-06-17 signals occurred on a single host that was Docker host + coordinator + mesh peer.) Multi-VM is a later extension.
- Not a CI gate; this is an interactive, agent-driven pre-deploy check on ubongo (CI stays lint + Molecule per ADR-008/010).
Decisions (from the 2026-06-18 brainstorm)
-
Virtualisation approach: libvirt/KVM directly (Approach A). A golden Debian-13 genericcloud qcow2 cached locally; each run boots an ephemeral qcow2 overlay backed by it, seeded via cloud-init NoCloud, driven by a stdlib-only Python script over
virsh(nolibvirt-pythondependency). Chosen over Vagrant+vagrant-libvirt (Ruby/plugin footprint, box drift from the real cloud image) and terraform-provider-libvirt (poor at the imperative apply→reboot→re-apply sequence, throwaway state, blurs ADR-006's prod-VM boundary). Lightest footprint on a 15 GiB control node; full control of the reboot step; the same Debian cloud image real hosts boot. -
Fidelity envelope: real OS/Docker/netfilter/reboot, not the Proxmox provisioning path. A lightweight local hypervisor is enough because the bugs are post-boot.
-
Scope: one throwaway VM at a time, instantiated from a real host's inventory. First profile: "be askari" (Docker host + NetBird coordinator + mesh peer on one box). The mechanism is generic — later "be" any host by swapping which inventory host it mirrors.
-
Acceptance is self-validating against the real failure. Done = the harness, on a local VM, applies
base(firewall on) to a Docker host, reboots, and observes the 2026-06-17 breakage (Docker forwarding dead / services down); then, with thedocker_hostcontainer-forward drop-in in place, the same run survives the reboot. If step 1 passes, the harness is not faithful. -
Tiered cert fidelity via a
--certsknob (DNS-01 is what makes real certs possible with no public inbound — validation is out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which the NAT net provides):internal(default) — Caddytls internal, zero deps, instant; for the incident repro and runs where certs aren't under test.le-staging— real DNS-01 ACME against Let's Encrypt staging: real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. Built in v1.le-prod-wildcard— a real trusted*.test.wingu.mewildcard, issued once, persisted on ubongo, reused across runs. Wired in v1 but on-demand only; its accepted risk is recorded when used (prod Gandi credential reaching an ephemeral VM; transient TXT in the realwingu.mezone). A deliberate "no-egress" failure scenario (to reproduce FRICTION #4) forcesinternal, since ACME needs egress.
-
The toolchain is Ansible-managed, not hand-installed: a new non-service role (
integration_test,controlgroup) installs/enables libvirt+QEMU reproducibly. The repo owns ubongo's state. The driver manages images lazily on first run (keeps the role lean; avoids fiddly download/refresh logic in Ansible). -
Stubs live in an overlay file, never in the real inventory — so
make tf-inventoryand "don't edit inventory directly" stay intact, and every stub is explicit and reviewable. -
A new ADR-025 records this decision (approach + alternatives + cert tiers); ADR-008 gains a pointer and redirects its "what Molecule does NOT test" gaps here.
Architecture — five isolated components
| # | Component | Purpose | Location |
|---|---|---|---|
| 1 | integration_test role (non-service, control group) |
Install/enable libvirt+QEMU+virtinst, add sjat/claude to libvirt group, create the image-cache dir, drop the driver. Idempotent, Molecule-tested. |
roles/integration_test/ |
| 2 | integration-vm.py driver |
Stdlib-only lifecycle over virsh: up / apply / reboot / assert / cycle / reset / down / prune / console. Lazily ensures the golden image (download + checksum). |
scripts/integration-vm.py |
| 3 | Profiles + var overlays | Make a VM "become" a host: pull that host's real group_vars/host_vars + layer a small explicit overlay (cert tier, in-VM coordinator endpoint, VM connection). | tests/integration/overrides/<host>.yml |
| 4 | Verify playbook | Outcome-based post-reboot assertions (Docker up, published-port DNAT alive, nft sane, service responds, wt0 up), reusing Molecule's verify.yml philosophy. |
tests/integration/verify.yml |
| 5 | Makefile target | make test-integration HOST=<name> [CERTS=...] [KEEP=1] → cycle; make test-integration-clean → prune. Documented in CLAUDE.md's command table. |
Makefile |
Lifecycle / data flow
make test-integration HOST=askari drives:
1. ensure golden image Debian-13 genericcloud qcow2, cached + checksum-verified
2. ephemeral overlay qcow2 backed by golden (throwaway; never mutate golden)
3. cloud-init NoCloud seed hostname + ansible user + ubongo's SSH key + NIC
4. virt-install --import boot on an isolated libvirt NAT net (DHCP IP + outbound NAT)
5. wait for SSH IP via `virsh domifaddr --source lease` (guest-agent optional)
6. transient inventory askari's real vars + ansible_host=<lease IP> + stub overlay
7. ansible-playbook site THE REAL APPLY (base + docker_host + reverse_proxy + coordinator)
8. [snapshot post-apply] optional reset point for fast re-runs
9. virsh reboot ──────────┐ ← the step Molecule structurally cannot do
10. wait for SSH ┘
11. ansible-playbook verify outcome assertions; THIS is where the incident surfaces
12. report + teardown pass/fail; on fail keep VM + dump diagnostics; else destroy overlay
Steps 1–7 build a real Docker daemon with real published-port DNAT to break; step 9 is a real kernel reboot, so nftables loads default-deny before Docker exactly as on askari.
Fidelity boundary & cert tiers
Faithful where the bug lives: real kernel, real netfilter, real Docker with published-port DNAT, the real role apply, a real reboot, and the coordinator running inside the VM so the VM is its own mesh peer — reproducing the circular mesh-bootstrap (FRICTION #3) on one box.
Stubbed where it needs the public internet (explicit, in the overlay): LE certs via the
--certs knob (Decision 5); public DNS (askari.wingu.me) → local resolution; NetBird
geo-DB → pre-seeded or requirement disabled (which is also the FRICTION #4 fix, so the
harness can test both the FATAL-loop and its remedy).
Acceptance test (self-validating)
- Run the cycle on today's
base(firewall on, nodocker_hostcontainer-forward drop-in) → step 11 must FAIL after reboot (Docker forwarding dead, services down). - Implement the
docker_hostcontainer-forward rules (the pending fix STATUS.md names) → re-run → step 11 must PASS across the reboot.
Scope boundary: the harness is this plan's deliverable. The docker_host
container-forward fix is a separate work item (FRICTION 2026-06-17 #1). v1's acceptance
deliberately spans both, because a credible harness must demonstrate both a true-negative
(red on the broken state) and a true-positive (green on the fixed state) — otherwise we have
only ever watched the assert go red. The plan decides sequencing: build the small
docker_host drop-in as the green-half of acceptance, or consume it if built separately
first. Minimum credible v1 is the red half (faithful reproduction); full acceptance is red→green.
This one round-trip proves the harness reproduces the incident, the fix works, and the loop can be trusted for the next risky change before it touches a live host.
Robustness, isolation & teardown
Failure leaves evidence (catching a bug is the point):
| Step fails | Behaviour |
|---|---|
| Golden image (1) | Fail fast, clear message; image cached (one-time cost) |
| Boot / first SSH (4–5) | Capture serial console to a log file, fail with its tail — the automated equivalent of the Hetzner console (ties to TODO 10.8) |
| Apply (7) | Keep VM, surface Ansible output, dump diagnostics |
| No SSH after reboot (9–10) | The classic incident signature; FAIL, keep VM, capture console — the harness succeeding |
| Assert (11) | FAIL, keep VM, dump post-mortem: nft list ruleset, docker ps, ss -tlnp, journalctl -b, systemd-analyze critical-chain; exit non-zero |
Diagnostics land in gitignored ~/integration-runs/<ts>-<host>/ (same pattern as ADR-017's
screenshot dir; the agent reads them directly).
Three safety invariants (these make the testing tool itself safe):
- The transient inventory contains only the test VM — no real host is ever in scope;
the apply is
--limited to the VM. - "Be askari" points NetBird at the in-VM coordinator (localhost) — the VM forms its own one-node mesh; it never enrolls in the real mesh.
- Test VMs sit on an isolated libvirt NAT net — outbound NAT for ACME/image pulls, but
not reachable to the LAN (
10.20.x) or the real mesh.
Resource guard (ubongo's 15 GiB ceiling, ADR-015/012): default VM ≈ 2 vCPU / 3 GiB / 20
GiB thin overlay; the driver refuses to start below a free-RAM threshold and enforces one
integration VM at a time (name-prefix boma-it-*). Teardown: success destroys domain +
overlay; failure keeps them and prints how to inspect; make test-integration-clean reaps
all boma-it-* orphans. An optional post-apply snapshot lets reset re-run
reboot+assert without re-applying (fast iteration on a fix).
Testing the tester
- pytest on the driver's pure logic: transient-inventory generation, var/overlay merge,
--certs→overlay mapping, DHCP-lease parsing, resource-guard math (mockvirsh). Joins boma's existing pytest suite. - Molecule (Docker) on the
integration_testrole: asserts libvirt/qemu/virtinst installed,libvirtdenabled, users inlibvirtgroup, driver present. (Cannot run KVM-in-Docker — the documented Molecule limitation.) - End-to-end self-test = the acceptance test above, run manually on first build and recorded in the runbook.
Governance & documentation touch-points
- ADR-025 "Local VM integration testing" — decision, approach A, rejected alternatives (Proxmox-nested / Vagrant / TF-libvirt), cert tiers.
- ADR-008 — pointer to ADR-025; redirect its "what Molecule does NOT test" gaps (nftables loading, mesh dataplane) to this level.
- ADR-015 — one-line reconciliation: "not a hypervisor" → runs ephemeral KVM test VMs as part of its local-test-runner role (still not a production hypervisor); note the test-VM RAM load.
docs/security/accepted-risks.md— thele-prod-wildcardrisk (prod Gandi credential → ephemeral VM; transient TXT in realwingu.me).- CLAUDE.md command table +
docs/runbooks/integration-testing.md(run a cycle, cert knobs, where diagnostics land, inspecting a kept failed VM, pruning) + STATUS.md entry. The runbook's pre-flight line operationalises FRICTION #6 (validate reboot-recovery before retiring the break-glass).
Capacity
One VM (~3 GiB) against ~13 GiB free is comfortable. The only future pinch is concurrency
with the Level-4 Chromium/Playwright stack (ADR-017) — handled by the resource guard +
"one at a time." Add a note to docs/hardware/reference.md; revisit at /capacity-review.
Alternatives considered
- Proxmox VE nested on ubongo — highest fidelity incl. the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs don't live in provisioning. Rejected.
- Vagrant + vagrant-libvirt — mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin, boxes drift from the real Debian cloud image, and the reboot→assert sequence still needs custom logic. Rejected.
- terraform-provider-libvirt — declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence, adds throwaway state, and blurs ADR-006's "TF owns production VM existence on Proxmox" boundary. Rejected.
Open questions / deferred
- Multi-VM mini-staging (inter-host mesh/dataplane) — design the driver + NAT net so a topology is an additive extension; out of scope for v1.
- Interplay with the Level-4 browser stack — both want ubongo RAM; the resource guard is the v1 answer, revisit when Level 4 is built.
- Snapshot strategy depth — v1 ships clone-and-destroy + an optional post-apply snapshot; richer snapshot trees deferred.
Knowledge to verify at plan stage (ADR-014)
These are from memory / unverified and must be confirmed against version-matched docs before the plan asserts them:
- Exact
virt-install --importflags and the cloud-init NoCloud seed format on the Debian-13 libvirt stack. - Whether the Debian-13 genericcloud image ships
qemu-guest-agent(IP can come from the DHCP lease regardless — guest-agent is an optimisation, not a requirement). - Let's Encrypt rate limits (prod vs staging) — to confirm "issue the wildcard once, reuse" stays within limits.
- The
caddy-dns/gandiDNS-01 configuration and pinned version already used byreverse_proxy, and whether the Gandi LiveDNS API key can be scoped totest.wingu.me. - libvirt default vs a dedicated isolated NAT network on Debian-13 (
virsh net-*).