sjat/boma

sjat 02e1eb7449 docs(spec): design local VM integration testing on ubongo (2.4)

Throwaway KVM VMs on ubongo (libvirt, Approach A) that mirror a real host (real Docker, real reboot, real role apply) to catch the reboot/firewall/boot-order class Molecule cannot - the 2026-06-17 mesh-hardening incident. First profile: be askari; tiered certs (internal + le-staging built, le-prod-wildcard on-demand). Concrete build of ADR-008 Level 2/3; to be recorded as ADR-025.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-18 11:35:51 +02:00

16 KiB

Raw Blame History

Local VM integration testing on ubongo (design)

Status: Designed, not built. Resolves docs/TODO.md item 2.4 (Local VM integration testing on ubongo, pre-deploy). Date: 2026-06-18. Implements: the concrete build of ADR-008 Level 2/3 (staging/integration), deferred for lack of hosts but hostable on ubongo. To be recorded as ADR-025.

Context

Molecule (ADR-008 Level 1) tests each role in a single Docker container: one converge, no real kernel netfilter, no real Docker daemon in the loop, and no reboot. That structurally cannot catch an entire class of bug — reboot-survivability, host-firewall × Docker interaction, and boot-ordering — which is exactly the class that caused the 2026-06-17 mesh-hardening incident:

base's nftables forward { policy drop; } killed the askari Docker host on reboot (nftables loaded its default-deny before Docker, breaking published-port DNAT and inter-container forwarding → public services + the mesh went down). It had worked right after make deploy, when Docker's runtime rules still coexisted. (FRICTION 2026-06-17 #1.)
ip_nonlocal_bind did not beat the sshd boot-race; sshd bound to the wt0 mesh IP had no listener at boot. (FRICTION #2.)
The coordinator host could not bootstrap the mesh it itself hosts. (FRICTION #3.)
NetBird netbird-server FATAL-loops on the GeoLite2 download when egress is lost — and egress was lost when nft flush wiped Docker's NAT masquerade. (FRICTION #4.)

Recovery needed the Hetzner console + a WAN-SSH break-glass. The lesson, already crystallised as a standing rule: firewall/sshd/boot changes must be tested on a real VM with a real reboot before they touch a live host, and a non-mesh break-glass must be kept.

This spec defines a way for the agent to spin up throwaway KVM VMs locally on ubongo that mirror a target host (real Docker, a real reboot, the real role apply) and validate risky infra changes before a live deploy. ubongo can host this today:

verified: ubongo KVM capability · Bash (this session) · /dev/kvm present + accessible (kvm group), Intel VT-x (vmx) enabled, 8 vCPU (i3-10100T), ~13 GiB RAM free of 16, ~198 GiB disk free; libvirt/QEMU/Vagrant not yet installed · 2026-06-18.

Goals

Reproduce the 2026-06-17 bug class locally: real OS boot, real Docker, real netfilter, the real role apply, a real reboot, then outcome assertions.
Let the agent drive the full loop autonomously: provision → apply → reboot → assert → teardown, with diagnostics captured on failure.
Mirror a real host from inventory (first profile: "be askari"), so the apply is faithful, not synthetic.
Be the concrete tool that operationalises the standing "test risky infra before live deploy" rule.

Non-goals (v1)

Not a production hypervisor on ubongo (reconciles ADR-015 — see Governance).
Not nested Proxmox; the provisioning chrome (template clone / Terraform) is not mirrored — every incident bug lives in the boot/kernel/Docker layer, not provisioning.
Not a multi-VM mini-cluster; one VM at a time. (All six 2026-06-17 signals occurred on a single host that was Docker host + coordinator + mesh peer.) Multi-VM is a later extension.
Not a CI gate; this is an interactive, agent-driven pre-deploy check on ubongo (CI stays lint + Molecule per ADR-008/010).

Decisions (from the 2026-06-18 brainstorm)

Virtualisation approach: libvirt/KVM directly (Approach A). A golden Debian-13 genericcloud qcow2 cached locally; each run boots an ephemeral qcow2 overlay backed by it, seeded via cloud-init NoCloud, driven by a stdlib-only Python script over virsh (no libvirt-python dependency). Chosen over Vagrant+vagrant-libvirt (Ruby/plugin footprint, box drift from the real cloud image) and terraform-provider-libvirt (poor at the imperative apply→reboot→re-apply sequence, throwaway state, blurs ADR-006's prod-VM boundary). Lightest footprint on a 15 GiB control node; full control of the reboot step; the same Debian cloud image real hosts boot.
Fidelity envelope: real OS/Docker/netfilter/reboot, not the Proxmox provisioning path. A lightweight local hypervisor is enough because the bugs are post-boot.
Scope: one throwaway VM at a time, instantiated from a real host's inventory. First profile: "be askari" (Docker host + NetBird coordinator + mesh peer on one box). The mechanism is generic — later "be" any host by swapping which inventory host it mirrors.
Acceptance is self-validating against the real failure. Done = the harness, on a local VM, applies base (firewall on) to a Docker host, reboots, and observes the 2026-06-17 breakage (Docker forwarding dead / services down); then, with the docker_host container-forward drop-in in place, the same run survives the reboot. If step 1 passes, the harness is not faithful.
Tiered cert fidelity via a --certs knob (DNS-01 is what makes real certs possible with no public inbound — validation is out-of-band via a Gandi TXT record; the VM needs only outbound to ACME + Gandi, which the NAT net provides):
- internal (default) — Caddy tls internal, zero deps, instant; for the incident repro and runs where certs aren't under test.
- le-staging — real DNS-01 ACME against Let's Encrypt staging: real caddy-gandi path, real cert files/renewal, untrusted root, effectively no rate limits. Built in v1.
- le-prod-wildcard — a real trusted *.test.wingu.me wildcard, issued once, persisted on ubongo, reused across runs. Wired in v1 but on-demand only; its accepted risk is recorded when used (prod Gandi credential reaching an ephemeral VM; transient TXT in the real wingu.me zone). A deliberate "no-egress" failure scenario (to reproduce FRICTION #4) forces internal, since ACME needs egress.
The toolchain is Ansible-managed, not hand-installed: a new non-service role (integration_test, control group) installs/enables libvirt+QEMU reproducibly. The repo owns ubongo's state. The driver manages images lazily on first run (keeps the role lean; avoids fiddly download/refresh logic in Ansible).
Stubs live in an overlay file, never in the real inventory — so make tf-inventory and "don't edit inventory directly" stay intact, and every stub is explicit and reviewable.
A new ADR-025 records this decision (approach + alternatives + cert tiers); ADR-008 gains a pointer and redirects its "what Molecule does NOT test" gaps here.

Architecture — five isolated components

#	Component	Purpose	Location
1	`integration_test` role (non-service, `control` group)	Install/enable libvirt+QEMU+virtinst, add `sjat`/`claude` to `libvirt` group, create the image-cache dir, drop the driver. Idempotent, Molecule-tested.	`roles/integration_test/`
2	`integration-vm.py` driver	Stdlib-only lifecycle over `virsh`: `up / apply / reboot / assert / cycle / reset / down / prune / console`. Lazily ensures the golden image (download + checksum).	`scripts/integration-vm.py`
3	Profiles + var overlays	Make a VM "become" a host: pull that host's real group_vars/host_vars + layer a small explicit overlay (cert tier, in-VM coordinator endpoint, VM connection).	`tests/integration/overrides/<host>.yml`
4	Verify playbook	Outcome-based post-reboot assertions (Docker up, published-port DNAT alive, `nft` sane, service responds, `wt0` up), reusing Molecule's `verify.yml` philosophy.	`tests/integration/verify.yml`
5	Makefile target	`make test-integration HOST=<name> [CERTS=...] [KEEP=1]` → `cycle`; `make test-integration-clean` → `prune`. Documented in CLAUDE.md's command table.	`Makefile`

Lifecycle / data flow

make test-integration HOST=askari drives:

 1. ensure golden image    Debian-13 genericcloud qcow2, cached + checksum-verified
 2. ephemeral overlay      qcow2 backed by golden (throwaway; never mutate golden)
 3. cloud-init NoCloud      seed hostname + ansible user + ubongo's SSH key + NIC
 4. virt-install --import   boot on an isolated libvirt NAT net (DHCP IP + outbound NAT)
 5. wait for SSH            IP via `virsh domifaddr --source lease` (guest-agent optional)
 6. transient inventory     askari's real vars + ansible_host=<lease IP> + stub overlay
 7. ansible-playbook site   THE REAL APPLY (base + docker_host + reverse_proxy + coordinator)
 8. [snapshot post-apply]   optional reset point for fast re-runs
 9. virsh reboot ──────────┐  ← the step Molecule structurally cannot do
10. wait for SSH           ┘
11. ansible-playbook verify outcome assertions; THIS is where the incident surfaces
12. report + teardown       pass/fail; on fail keep VM + dump diagnostics; else destroy overlay

Steps 1–7 build a real Docker daemon with real published-port DNAT to break; step 9 is a real kernel reboot, so nftables loads default-deny before Docker exactly as on askari.

Fidelity boundary & cert tiers

Faithful where the bug lives: real kernel, real netfilter, real Docker with published-port DNAT, the real role apply, a real reboot, and the coordinator running inside the VM so the VM is its own mesh peer — reproducing the circular mesh-bootstrap (FRICTION #3) on one box.

Stubbed where it needs the public internet (explicit, in the overlay): LE certs via the --certs knob (Decision 5); public DNS (askari.wingu.me) → local resolution; NetBird geo-DB → pre-seeded or requirement disabled (which is also the FRICTION #4 fix, so the harness can test both the FATAL-loop and its remedy).

Acceptance test (self-validating)

Run the cycle on today's base (firewall on, no docker_host container-forward drop-in) → step 11 must FAIL after reboot (Docker forwarding dead, services down).
Implement the docker_host container-forward rules (the pending fix STATUS.md names) → re-run → step 11 must PASS across the reboot.

Scope boundary: the harness is this plan's deliverable. The docker_host container-forward fix is a separate work item (FRICTION 2026-06-17 #1). v1's acceptance deliberately spans both, because a credible harness must demonstrate both a true-negative (red on the broken state) and a true-positive (green on the fixed state) — otherwise we have only ever watched the assert go red. The plan decides sequencing: build the small docker_host drop-in as the green-half of acceptance, or consume it if built separately first. Minimum credible v1 is the red half (faithful reproduction); full acceptance is red→green.

This one round-trip proves the harness reproduces the incident, the fix works, and the loop can be trusted for the next risky change before it touches a live host.

Robustness, isolation & teardown

Failure leaves evidence (catching a bug is the point):

Step fails	Behaviour
Golden image (1)	Fail fast, clear message; image cached (one-time cost)
Boot / first SSH (4–5)	Capture serial console to a log file, fail with its tail — the automated equivalent of the Hetzner console (ties to TODO 10.8)
Apply (7)	Keep VM, surface Ansible output, dump diagnostics
No SSH after reboot (9–10)	The classic incident signature; FAIL, keep VM, capture console — the harness succeeding
Assert (11)	FAIL, keep VM, dump post-mortem: `nft list ruleset`, `docker ps`, `ss -tlnp`, `journalctl -b`, `systemd-analyze critical-chain`; exit non-zero

Diagnostics land in gitignored ~/integration-runs/<ts>-<host>/ (same pattern as ADR-017's screenshot dir; the agent reads them directly).

Three safety invariants (these make the testing tool itself safe):

The transient inventory contains only the test VM — no real host is ever in scope; the apply is --limited to the VM.
"Be askari" points NetBird at the in-VM coordinator (localhost) — the VM forms its own one-node mesh; it never enrolls in the real mesh.
Test VMs sit on an isolated libvirt NAT net — outbound NAT for ACME/image pulls, but not reachable to the LAN (10.20.x) or the real mesh.

Resource guard (ubongo's 15 GiB ceiling, ADR-015/012): default VM ≈ 2 vCPU / 3 GiB / 20 GiB thin overlay; the driver refuses to start below a free-RAM threshold and enforces one integration VM at a time (name-prefix boma-it-*). Teardown: success destroys domain + overlay; failure keeps them and prints how to inspect; make test-integration-clean reaps all boma-it-* orphans. An optional post-apply snapshot lets reset re-run reboot+assert without re-applying (fast iteration on a fix).

Testing the tester

pytest on the driver's pure logic: transient-inventory generation, var/overlay merge, --certs→overlay mapping, DHCP-lease parsing, resource-guard math (mock virsh). Joins boma's existing pytest suite.
Molecule (Docker) on the integration_test role: asserts libvirt/qemu/virtinst installed, libvirtd enabled, users in libvirt group, driver present. (Cannot run KVM-in-Docker — the documented Molecule limitation.)
End-to-end self-test = the acceptance test above, run manually on first build and recorded in the runbook.

Governance & documentation touch-points

ADR-025 "Local VM integration testing" — decision, approach A, rejected alternatives (Proxmox-nested / Vagrant / TF-libvirt), cert tiers.
ADR-008 — pointer to ADR-025; redirect its "what Molecule does NOT test" gaps (nftables loading, mesh dataplane) to this level.
ADR-015 — one-line reconciliation: "not a hypervisor" → runs ephemeral KVM test VMs as part of its local-test-runner role (still not a production hypervisor); note the test-VM RAM load.
docs/security/accepted-risks.md — the le-prod-wildcard risk (prod Gandi credential → ephemeral VM; transient TXT in real wingu.me).
CLAUDE.md command table + docs/runbooks/integration-testing.md (run a cycle, cert knobs, where diagnostics land, inspecting a kept failed VM, pruning) + STATUS.md entry. The runbook's pre-flight line operationalises FRICTION #6 (validate reboot-recovery before retiring the break-glass).

Capacity

One VM (~3 GiB) against ~13 GiB free is comfortable. The only future pinch is concurrency with the Level-4 Chromium/Playwright stack (ADR-017) — handled by the resource guard + "one at a time." Add a note to docs/hardware/reference.md; revisit at /capacity-review.

Alternatives considered

Proxmox VE nested on ubongo — highest fidelity incl. the provisioning step, but heavy (nested virt, RAM), in tension with ADR-015, and the incident bugs don't live in provisioning. Rejected.
Vagrant + vagrant-libvirt — mature lifecycle/snapshots, but adds the Ruby/Vagrant ecosystem + a fragile plugin, boxes drift from the real Debian cloud image, and the reboot→assert sequence still needs custom logic. Rejected.
terraform-provider-libvirt — declarative and reuses TF, but poor at the imperative apply→reboot→re-apply test sequence, adds throwaway state, and blurs ADR-006's "TF owns production VM existence on Proxmox" boundary. Rejected.

Open questions / deferred

Multi-VM mini-staging (inter-host mesh/dataplane) — design the driver + NAT net so a topology is an additive extension; out of scope for v1.
Interplay with the Level-4 browser stack — both want ubongo RAM; the resource guard is the v1 answer, revisit when Level 4 is built.
Snapshot strategy depth — v1 ships clone-and-destroy + an optional post-apply snapshot; richer snapshot trees deferred.

Knowledge to verify at plan stage (ADR-014)

These are from memory / unverified and must be confirmed against version-matched docs before the plan asserts them:

Exact virt-install --import flags and the cloud-init NoCloud seed format on the Debian-13 libvirt stack.
Whether the Debian-13 genericcloud image ships qemu-guest-agent (IP can come from the DHCP lease regardless — guest-agent is an optimisation, not a requirement).
Let's Encrypt rate limits (prod vs staging) — to confirm "issue the wildcard once, reuse" stays within limits.
The caddy-dns/gandi DNS-01 configuration and pinned version already used by reverse_proxy, and whether the Gandi LiveDNS API key can be scoped to test.wingu.me.
libvirt default vs a dedicated isolated NAT network on Debian-13 (virsh net-*).

16 KiB Raw Blame History Unescape Escape